Synthetic genes for plant gums and other hydroxyproline-rich proteins

ABSTRACT

A new approach in the field of plant gums is described which presents a new solution to the production of hydroxyproline (Hyp)-rich glycoproteins (HRGPs), repetitive proline-rich proteins (RPRPs) and arabinogalactan-proteins (AGPs). The expression of synthetic genes designed from repetitive peptide sequences of such glycoproteins, including the peptide sequences of gum arabic glycoprotein (GAGP), is taught in host cells, including plant host cells.

FIELD OF THE INVENTION

The present invention relates generally to the field of plant gums andother hydroxyproline-rich glycoproteins, and in particular, to theexpression of synthetic genes designed from repetitive peptidesequences.

BACKGROUND

Gummosis is a common wound response that results in the exudation of agum sealant at the site of cracks in bark. A. M. Stephen et al.,“Exudate Gums”, Methods Plant Biochem. (1990). Generally the exudate isa composite of polysaccharides and glycoproteins structurally related tocell wall components such as galactans [G. O. Aspinall, “Plant Gums”,The Carbohydrates 2B:522536 (1970)] and hydroxyproline-richglycoproteins [Anderson and McDougall, “The chemical characterization ofthe gum exudates from eight Australian Acacia species of the seriesPhyllodineae.” Food Hydrocolloids, 2: 329 (1988)].

Gum arabic is probably the best characterized of these exudates(although it has been largely refractory to chemical analysis). It is anatural plant exudate secreted by various species of Acacia trees.Acacia senegal accounts for approximately 80% of the production of gumarabic with Acacia seyal, Acacia laeta, Acacia camplylacantha, andAcacia drepanolobium supplying the remaining 20%. The gum is gathered byhand in Africa. It is a tedious process involving piercing and strippingthe bark of the trees, then returning later to gather the dried teardrop shaped, spherical balls that form in response to mechanicalwounding.

The exact chemical nature of gum arabic has not been elucidated. It isbelieved to consist of two major components, a microheterogeneousglucurono-arabinorhamnogalactan polysaccharide and a higher molecularweight hydroxyproline-rich glycoprotein. Osman et al., “Characterizationof Gum Arabic Fractions Obtained By Anion-Exchange Chromatography”Phytochemistry 38:409 (1984) and Qi et al., “Gum Arabic Glycoprotein IsA Twisted Hairy Rope” Plant Physiol. 96:848 (1991). While the aminocomposition of the protein portion has been examined, little is knownwith regard to the precise amino acid sequence.

While the precise chemical nature of gum arabic is elusive, the gum isnonetheless particularly useful due to its high solubility and lowviscosity compared to other gums. The FDA declared the gum to be a GRASfood additive. Consequently, it is widely used in the food industry as athickener, emulsifier, stabilizer, surfactant, protective colloid, andflavor fixative or preservative. J. Dziezak, “A Focus on Gums” FoodTechnology (March 1991). It is also used extensively in the cosmeticsindustry.

Normally, the world production of gum arabic is over 100,000 tons peryear. However, this production depends on the environmental andpolitical stability of the region producing the gum. In the early 1970s,for example, a severe drought reduced gum production to 30,00 tons.Again in 1985, drought brought about shortages of the gum, resulting ina 600% price increase.

Three approaches have been used to deal with the somewhat precarioussupply problem of gum arabic. First, other gums have been sought out inother regions of the world. Second, additives have been investigated tosupplement inferior gum arabic. Third, production has been investigatedin cultured cells.

The effort to find other gums in other regions of the world has met withsome limited success. However, the solubility of gum arabic from Acaciais superior to other gums because it dissolves well in either hot orcold water. Moreover, while other exudates are limited to a 5% solutionbecause of their excessive viscosity, gum arabic can be dissolvedreadily to make 55% solutions.

Some additives have been identified to supplement gum arabic. Forexample, whey proteins can be used to increase the functionality of gumarabic. A. Prakash et al., “The effects of added proteins on thefunctionality of gum arabic in soft drink emulsion systems,” FoodHydrocolloids 4:177 (1990). However, this approach has limitations. Onlylow concentrations of such additives can be used without producingoff-flavors in the final food product.

Attempts to produce gum arabic in cultured Acacia senegal cells has beenexplored. Unfortunately, conditions have not been found which lead tothe expression of gum arabic in culture. A. Mollard and J-P. Joseleau,“Acacia senegal cells cultured in suspension secrete ahydroxyproline-deficient arabinogalactan-protein” Plant Physiol.Biochem. 32:703 (1994).

Clearly, new approaches to improve gum arabic production are needed.Such approaches should not be dependent on environmental or politicalfactors. Ideally, such approaches should simplify production and berelatively inexpensive.

SUMMARY OF THE INVENTION

The present invention involves a new approach in the field of plant gumsand presents a new solution to the production of hydroxyproline(Hyp)-rich glycoproteins (HRGPs), repetitive proline-rich proteins(RPRPs) and arabinogalactan-proteins (AGPs). The present inventioncontemplates the expression of synthetic genes designed from repetitivepeptide sequences of such glycoproteins, including the peptide sequencesof gum arabic glycoprotein (GAGP).

With respect to GAGP, the present invention contemplates a substantiallypurified polypeptide comprising at least a portion of the amino acidsequenceSer-Hyp-Hyp-Hyp-[Hyp/Thr]-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Hyp-Gly-Pro-His(SEQ ID NO:1 and SEQ ID NO:2) or variants thereof. By “variants” it ismeant that the sequence need not comprise the exact sequence; up to five(5) amino acid substitutions are contemplated. For example, a Leu or Hypmay be substituted for the Gly; Leu may also be substituted for Ser andone or more Hyp. By “variants” it is also meant that the sequence neednot be the entire nineteen (19) amino acids. Illustrative variants areshown in Table 3. In one preferred embodiment, variants contain one ormore of the following three motifs: Ser-Hyp₄, Ser-Hyp₃-Thr, andXaa-Hyp-Xaa-Hyp, where Xaa is any amino acid other than hydroxyproline.

Indeed, it is not intended that the present invention be limited by theprecise length of the purified polypeptide. In one embodiment, thepeptide comprises more than twelve (12) amino acids from the nineteen(19) amino acids of the sequence. In another embodiment, a portion ofthe nineteen (19) amino acids (see SEQ ID NO:1 and SEQ ID NO:2) isutilized as a repetitive sequence. In yet another embodiment, allnineteen (19) amino acids (see SEQ ID NO:1 and SEQ ID NO:2) with orwithout amino acid substitutions) are utilized as a repetitive sequence.

It is not intended that the present invention be limited by the precisenumber of repeats. The sequence (i.e. SEQ ID NO:1 and SEQ ID NO:2) orvariants thereof may be used as a repeating sequence between one (1) andup to fifty (50) times, more preferably between ten (10) and up tothirty (30) times, and most preferably approximately twenty (20) times.The sequence (i.e. SEQ ID NO:1 and SEQ ID NO:2) or variants thereof maybe used as contiguous repeats or may be used as non-contiguous repeats(with other amino acids, or amino acid analogues, placed between therepeating sequences).

The present invention specifically contemplates fusion proteinscomprising a non-gum arabic protein or glycoprotein sequence and aportion of the gum arabic glycoprotein sequence (SEQ ID NO:1 and SEQ IDNO:2). It is not intended that the present invention be limited by thenature of the non-gum arabic glycoprotein sequence. In one embodiment,the non-gum arabic glycoprotein sequence is a green fluorescent protein.

As noted above, the present invention contemplates synthetic genesencoding such peptides. By “synthetic genes” it is meant that thenucleic acid sequence is derived using the peptide sequence of interest(in contrast to using the nucleic acid sequence from cDNA). In oneembodiment, the present invention contemplates an isolatedpolynucleotide sequence encoding a polypeptide comprising at least aportion of the polypeptide of SEQ ID NO:1 and SEQ ID NO:2 or variantsthereof. The present invention specifically contemplates apolynucleotide sequence comprising a nucleotide sequence encoding apolypeptide comprising one or more repeats of SEQ ID NO:1 and SEQ IDNO:2 or variants thereof. Importantly, it is not intended that thepresent invention be limited to the precise nucleic acid sequenceencoding the polypeptide of interest.

The present invention contemplates synthetic genes encoding portions ofHRGPs, wherein the encoded peptides contain one or more of the highlyconserved Ser-Hyp₄ (SEQ ID NO:3) motif(s). The present invention alsocontemplates synthetic genes encoding portions of RPRPs, wherein theencoded peptides contain one or more of the pentapeptide motif:Pro-Hyp-Val-Tyr-Lys (SEQ ID NO:4) and variants of this sequence such asX-Hyp-Val-Tyr-Lys (SEQ ID NO:5) and Pro-Hyp-Val-X-Lys (SEQ ID NO:6) andPro-Pro-X-Tyr-Lys and Pro-Pro-X-Tyr-X (SEQ ID NO:8), where “X” can beThr, Glu, Hyp, Pro, His and Ile. The present invention also contemplatessynthetic genes encoding portions of AGPs, wherein the encoded peptidescontain one or more Xaa-Hyp-Xaa-Hyp (SEQ ID NO:9) repeats. Such peptidescan be expressed in a variety of forms, including but not limited tofusion proteins.

With regard to motifs for HRGPs, the present invention contemplates apolynucleotide sequence comprising the sequence: 5′-CCA CCA CCT TCA CCTCCA CCC CCA TCT CCA-3′ (SEQ ID NO:10). With regard to motifs for AGPs,the present invention contemplates a polynucleotide sequence comprisingthe sequence: 5′-TCA CCA TCA CCA TCT CCT TCG CCA TCA CCC-3′ (SEQ IDNO:11). Of course, it is not intended that the present invention belimited by the particular sequence. Indeed, the present inventionspecifically contemplates sequences that are not identical but arenonetheless homologous to the sequences of SEQ ID NOS: 10 and 11. Thepresent invention also contemplates sequences that are complementary(including sequences that are only partially complementary) sequences tothe sequences of SEQ ID NOS: 10 and 11. Such complementary sequencesinclude sequences that will hybridize to the sequences of SEQ ID NOS: 10and 11 under low stringency conditions as well as high stringencyconditions (see Definitions below).

The present invention also contemplates the mixing of motifs (i.e.modules) which are not found in wild-type sequences. For example, onemight add GAGP modules to extensin and RPRP crosslinking modules toAGP-like molecules.

The present invention contemplates using the polynucleotides of thepresent invention for expression of the polypeptides in vitro and invivo. Therefore, the present invention contemplates polynucleotidesequences encoding two or more repeats of the sequence of SEQ ID NO:1and SEQ ID NO:2 or variants thereof, wherein said polynucleotidesequence is contained on a recombinant expression vector. It is alsocontemplated that such vectors will be introduced into a variety of hostcells, both eukaryotic and prokaryotic (e.g. bacteria such as E. coli).

In one embodiment, the vector further comprises a promoter. It is notintended that the present invention be limited to a particular promoter.Any promoter sequence which is capable of directing expression of anoperably linked nucleic acid sequence encoding a portion of a plant gumpolypeptide (or other hydroxyproline-rich polypeptide of interest asdescribed above) is contemplated to be within the scope of theinvention. Promoters include, but are not limited to, promoter sequencesof bacterial, viral and plant origins. Promoters of bacterial origininclude, but are not limited to, the octopine synthase promoter, thenopaline synthase promoter and other promoters derived from native Tiplasmids. Viral promoters include, but are not limited to, the 35S and19S RNA promoters of cauliflower mosaic virus (CaMV), and T-DNApromoters from Agrobacterium. Plant promoters include, but are notlimited to, the ribulose-1,3-bisphosphate carboxylase small subunitpromoter, maize ubiquitin promoters, the phaseolin promoter, the E8promoter, and the Tob7 promoter.

The invention is not limited to the number of promoters used to controlexpression of a nucleic acid sequence of interest. Any number ofpromoters may be used so long as expression of the nucleic acid sequenceof interest is controlled in a desired manner. Furthermore, theselection of a promoter may be governed by the desirability thatexpression be over the whole plant, or localized to selected tissues ofthe plant, e.g., root, leaves, fruit, etc. For example, promoters activein flowers are known (Benfy et al. (1990) Plant Cell 2:849-856).

The promoter activity of any nucleic acid sequence in host cells may bedetermined (i.e., measured or assessed) using methods well known in theart and exemplified herein. For example, a candidate promoter sequencemay be tested by ligating it in-frame to a reporter gene sequence togenerate a reporter construct, introducing the reporter construct intohost cells (e.g. tomato or potato cells) using methods described herein,and detecting the expression of the reporter gene (e.g., detecting thepresence of encoded mRNA or encoded protein, or the activity of aprotein encoded by the reporter gene). The reporter gene may conferantibiotic or herbicide resistance. Examples of reporter genes include,but are not limited to, dhfr which confers resistance to methotrexate[Wigler M et al., (1980) Proc Natl Acad Sci 77:3567-70]; npt, whichconfers resistance to the aminoglycosides neomycin and G-418[Colbere-Garapin F et al., (1981) J. Mol. Biol. 150:1-14] and als orpat, which confer resistance to chlorsulfuron and phosphinotricin acetyltransferase, respectively. Recently, the use of a reporter gene systemwhich expresses visible markers has gained popularity with such markersas β-glucuronidase and its substrate (X-Gluc), luciferase and itssubstrate (luciferin), and β-galactosidase and its substrate (X-Gal)being widely used not only to identify transformants, but also toquantify the amount of transient or stable protein expressionattributable to a specific vector system [Rhodes C A et al. (1995)Methods Mol Biol 55:121-131].

In addition to a promoter sequence, the expression construct preferablycontains a transcription termination sequence downstream of the nucleicacid sequence of interest to provide for efficient termination. In oneembodiment, the termination sequence is the nopaline synthase (NOS)sequence. In another embodiment the termination region comprisesdifferent fragments of sugarcane ribulose-1,5-biphosphatecarboxylase/oxygenase (rubisco) small subunit (scrbcs) gene. Thetermination sequences of the expression constructs are not critical tothe invention. The termination sequence may be obtained from the samegene as the promoter sequence or may be obtained form different genes.

If the mRNA encoded by the nucleic acid sequence of interest is to beefficiently translated, polyadenylation sequences are also commonlyadded to the expression construct. Examples of the polyadenylationsequences include, but are not limited to, the Agrobacterium octopinesynthase signal, or the nopaline synthase signal.

The invention is not limited to constructs which express a singlenucleic acid sequence of interest. Constructs which contain a pluralityof (i.e., two or more) nucleic acid sequences under the transcriptionalcontrol of the same promoter sequence are expressly contemplated to bewithin the scope of the invention. Also included within the scope ofthis invention are constructs which contain the same or differentnucleic acid sequences under the transcriptional control of differentpromoters. Such constructs may be desirable to, for example, targetexpression of the same or different nucleic acid sequences of interestto selected plant tissues.

As noted above, the present invention contemplates using thepolynucleotides of the present invention for expression of a portion ofplant gum polypeptides in vitro and in vivo. Where expression takesplace in vivo, the present invention contemplates transgenic plants. Thetransgenic plants of the invention are not limited to plants in whicheach and every cell expresses the nucleic acid sequence of interest.Included within the scope of this invention is any plant (e.g. tobacco,tomato, maize, algae, etc.) which contains at least one cell whichexpresses the nucleic acid sequence of interest. It is preferred, thoughnot necessary, that the transgenic plant express the nucleic acidsequence of interest in more than one cell, and more preferably in oneor more tissue. It is particularly preferred that expression be followedby proper glycosylation of the plant gum polypeptide fragment or variantthereof, such that the host cell produces functional (e.g. in terms ofuse in the food or cosmetic industry) plant gum polypeptide.

The fact that transformation of plant cells has taken place with thenucleic acid sequence of interest may be determined using any number ofmethods known in the art. Such methods include, but are not limited to,restriction mapping of genomic DNA, PCR analysis, DNA-DNA hybridization,DNA-RNA hybridization, and DNA sequence analysis.

Expressed polypeptides (or fragments thereof) can be immobilized(covalently or non-covalently) on solid supports or resins for use inisolating HRGP-binding molecules from a variety of sources (e.g. algae,plants, animals, microorganisms). Such polypeptides can also be used tomake antibodies.

The invention further provides a substantially purified polypeptidecomprising at least a portion of the gum arabic consensus sequence. Inparticular, the invention provides a substantially purified polypeptidecomprising at least a portion of amino acid sequenceA-Hyp-B-C-D-E-F-Hyp-G-H-I-Hyp-J-Hyp-Hyp-K-L-Pro-M (SEQ ID NO:136),wherein A is selected from Ser, Thr, and Ala; B is selected from Hyp,Pro, Leu, and Ile; C is selected from Pro and Hyp; D is selected fromHyp, Pro, Ser, Thr, and Ala; E is selected from Leu and Ile; F isselected from Ser, Thr, and Ala; G is selected from Ser, Leu, Hyp, Thr,Ala, and Ile; His selected from Hyp, Pro, Leu, and Ile; I is selectedfrom Thr, Ala, and Ser; J is selected from Thr, Ser, and Ala; K isselected from Thr, Leu, Hyp, Ser, Ala, and Ile; L is selected from Gly,Leu, Ala, and Ile; and M is selected from His and Pro; and wherein theportion is greater than twelve contiguous amino acids of the amino acidsequence. In a preferred embodiment, the portion occurs in thepolypeptide as a repeating sequence. In a more preferred embodiment, therepeating sequence repeats from 1 to 64 times. In an alternativepreferred embodiment, A is Ser; B is selected from Hyp, and Leu; D isselected from Hyp, Ser, and Thr; E is Leu; F is Ser; G is selected fromSer, Leu, and Hyp; His selected from Hyp, Pro, and Leu; I is selectedfrom Thr and Ala; J is Thr; K is selected from Thr, Leu, and Hyp; L isselected from Gly and Leu; and M is selected from His and Pro. Inanother alternative embodiment, the amino acid sequence is selected fromSer-Hyp-Hyp-Hyp-Hyp-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His(SEQ ID NO:143),Ser-Hyp-Hyp-Hyp-Thr-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Hyp-Gly-Pro-His(SEQ ID NO:144),Ser-Hyp-Hyp-Hyp-Ser-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Thr-Gly-Pro-His(SEQ ID NO:145),Ser-Hyp-Hyp-Hyp-Hyp-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Hyp-Gly-Pro-Hyp(SEQ ID NO:146),Ser-Hyp-Leu-Pro-Thr-Leu-Ser-Hyp-Leu-Pro-Thr-Hyp-Thr-Hyp-Hyp-Hyp-Gly-Pro-His(SEQ ID NO:147),Ser-Hyp-Leu-Pro-Thr-Leu-Ser-Hyp-Leu-Pro-Ala-Hyp-Thr-Hyp-Hyp-Hyp-Gly-Pro-His(SEQ ID NO:148),Ser-Hyp-Hyp-Hyp-Hyp-Leu-Ser-Hyp-Ser-Leu-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-Hyp(SEQ ID NO:149),Ser-Hyp-Hyp-Hyp-Hyp-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Hyp-Gly-Pro-His(SEQ ID NO:150),Ser-Hyp-Hyp-Hyp-Thr-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Hyp-Gly-Pro-His(SEQ ID NO:151),Ser-Hyp-Hyp-Hyp-Hyp-Leu-Ser-Hyp-Ser-Hyp-Ala-Hyp-Thr-Hyp-Hyp-Hyp-Gly-Pro-His(SEQ ID NO:152),Ser-Hyp-Hyp-Hyp-Hyp-Leu-Ser-Hyp-Leu-Pro-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His(SEQ ID NO:153),Ser-Hyp-Hyp-Hyp-Ser-Leu-Ser-Hyp-Leu-Pro-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His(SEQ ID NO:154),Ser-Hyp-Hyp-Hyp-Thr-Leu-Ser-Hyp-Hyp-Leu-Thr-Hyp-Thr-Hyp-Hyp-Leu-Leu-Pro-His(SEQ ID NO:155),Hyp-Hyp-Thr-Leu-Ser-Hyp-Hyp-Leu-Thr-Hyp-Thr-Hyp-Hyp-Leu-Leu-Pro (SEQ IDNO:156), Ser-Hyp-Hyp-Hyp-Ser-Leu-Ser-Hyp-Leu-Pro-Thr-Hyp-Thr-Hyp-Hyp-Leu(SEQ ID NO:157),Hyp-Hyp-Leu-Ser-Hyp-Leu-Pro-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His (SEQ IDNO:158), Ser-Hyp-Hyp-Hyp-Thr-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp (SEQ IDNO:159), Leu-Ser-Hyp-Ser-Leu-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-Hyp (SEQ IDNO:160), Hyp-Thr-Leu-Ser-Hyp-Leu-Pro-Ala-Hyp-Thr-Hyp-Hyp-Hyp-Gly (SEQ IDNO:161), Ser-Hyp-Hyp-Hyp-Hyp-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp (SEQ IDNO:162), Ser-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Thr (SEQ IDNO:163), Hyp-Hyp-Thr-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp (SEQ IDNO:164), Hyp-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His (SEQ ID NO:165),Hyp-Hyp-Thr-Leu-Ser-Hyp-Hyp-Leu-Thr-Hyp (SEQ ID NO:166),Ser-Hyp-Hyp-Hyp-Ser-Leu-Ser-Hyp-Leu-Pro (SEQ ID NO:167),Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His (SEQ ID NO:168),Hyp-Leu-Ser-Hyp-Ser-Hyp-Ala-Hyp (SEQ ID NO:169),Hyp-Hyp-Hyp-Thr-Leu-Ser-Hyp-Ser (SEQ ID NO:170), Thr-Hyp-Hyp-Hyp-Gly-Pro(SEQ ID NO:171), Hyp-Hyp-Leu-Ser-Hyp-Ser (SEQ ID NO:172),Ser-Hyp-Leu-Pro-Ala-Hyp (SEQ ID NO:173), Leu-Pro-Thr-Leu-Ser-Hyp (SEQ IDNO:174), Ser-Hyp-Ser-Hyp (SEQ ID NO:175), Ser-Hyp-Thr-Hyp (SEQ IDNO:176), Thr-Hyp-Thr-Hyp (SEQ ID NO:177), Thr-Hyp-Hyp-Hyp (SEQ IDNO:178),Ser-Hyp-Pro-Pro-Pro-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His(SEQ ID NO:217),Ser-Hyp-Hyp-Pro-Pro-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His(SEQ ID NO:218),Ser-Hyp-Pro-Hyp-Pro-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His(SEQ ID NO:219),Ser-Hyp-Pro-Pro-Hyp-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His(SEQ ID NO:220),Ser-Hyp-Hyp-Hyp-Pro-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His(SEQ ID NO:221),Ser-Hyp-Hyp-Pro-Hyp-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His(SEQ ID NO:222),Ser-Hyp-Pro-Hyp-Hyp-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His(SEQ ID NO:223),Ser-Hyp-Hyp-Hyp-Hyp-Leu-Ser-Hyp-Ser-Pro-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His(SEQ ID NO:224),Ser-Hyp-Hyp-Hyp-Hyp-Leu-Ser-Hyp-Ser-Leu-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His(SEQ ID NO:225),Ser-Hyp-Hyp-Hyp-Thr-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Hyp-Gly-Pro-His-Ser-Hyp-Hyp-Hyp-(Hyp)(SEQ ID NO:18), Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His (SEQID NO:23),Ser-Hyp-Hyp-Hyp-A-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-B-Gly-Pro-His(SEQ ID NO:179), where A is selected from Hyp, Thr, and Ser, and B isselected from Hyp and Lys, SEQ ID NO:131, and SEQ ID NO:133. In yetanother alternative embodiment, the portion comprises a motif selectedfrom (Xaa-Hyp)_(x) (SEQ ID NO:182) and Xaa-Hyp-Xaa-Xaa-Hyp-Xaa (SEQ IDNO:183), wherein Xaa is any amino acid other than hydroxyproline, andwherein x is from 2 to 1000. In a preferred embodiment, the portioncomprises the sequence Xaa-Hyp-Xaa-Hyp (SEQ ID NO:9), and wherein Xaa isselected from Ser, Thr, and Ala. In a further alternative embodiment,the portion comprises a motif selected from Xaa-Hyp-Hyp_(n) (SEQ IDNO:209) and Xaa-Pro-Hyp_(n) (SEQ ID NO:210), wherein n is from 1 to 100,and wherein Xaa is any amino acid other than hydroxyproline. In apreferred embodiment, the portion comprises a peptide sequence selectedfrom Ser-Hyp₂ (SEQ ID NO:211), Ser-Hyp₃ (SEQ ID NO:212), Ser-Hyp₄ (SEQID NO:3), Thr-Hyp₂ (SEQ ID NO:213), and Thr-Hyp₃ (SEQ ID NO:214). In anadditional alternative embodiment, the portion comprises a peptidesequence selected from Ser-Hyp₂-Pro (SEQ ID NO:215) and Ser-Hyp₂-Pro-Hyp(SEQ ID NO:216).

The invention further provides a substantially purified polypeptidecomprising a non-contiguous hydroxyproline motif. In particular, theinvention provides a substantially purified polypeptide comprising afirst motif selected from (Xaa-Hyp)_(x) (SEQ ID NO:182) andXaa-Hyp-Xaa-Xaa-Hyp-Xaa (SEQ ID NO:183), wherein Xaa is any amino acidother than hydroxyproline, and wherein x is from 2 to 1000. In oneembodiment, the sequence is Xaa-Hyp-Xaa-Hyp (SEQ ID NO:9), wherein Xaais selected from Ser, Thr, and Ala. In an alternative embodiment, thepolypeptide further comprises a contiguous hydroxyproline motif (i.e., asecond motif) selected from Xaa-Hyp-Hyp_(n) (SEQ ID NO:209) andXaa-Pro-Hyp_(n) (SEQ ID NO:210), wherein n is from 1 to 100, and whereinXaa is any amino acid other than hydroxyproline. In a preferredembodiment, the first and second motifs alternate in the polypeptide. Ina more preferred embodiment, the alternating first and second motifsrepeat from 1 to 500 times.

Also provided herein is a substantially purified polypeptide comprisinga motif selected from Xaa-Hyp-Hyp_(n) (SEQ ID NO:209) andXaa-Pro-Hyp_(n) (SEQ ID NO:210), wherein n is from 1 to 100, and whereinXaa is any amino acid other than hydroxyproline. In one embodiment, theportion comprises a peptide sequence selected from Ser-Hyp₂ (SEQ IDNO:211), Ser-Hyp₃ (SEQ ID NO:212), Ser-Hyp₄ (SEQ ID NO:3), Thr-Hyp₂ (SEQID NO:213), and Thr-Hyp₃ (SEQ ID NO:214).

The invention also provides a fusion protein comprising a first sequenceselected from a non-gum arabic protein sequence and a non-gum arabicglycoprotein sequence operably linked to at least a portion of an aminoacid sequence selected from (a)A-Hyp-B-C-D-E-F-Hyp-G-H-I-Hyp-J-Hyp-Hyp-K-L-Pro-M (SEQ ID NO:136),wherein A is selected from Ser, Thr, and Ala; B is selected from Hyp,Pro, Leu, and Ile; C is selected from Pro and Hyp; D is selected fromHyp, Pro, Ser, Thr, and Ala; E is selected from Leu and Ile; F isselected from Ser, Thr, and Ala; G is selected from Ser, Leu, Hyp, Thr,Ala, and Ile; His selected from Hyp, Pro, Leu, and Ile; I is selectedfrom Thr, Ala, and Ser; J is selected from Thr, Ser, and Ala; K isselected from Thr, Leu, Hyp, Ser, Ala, and Ile; L is selected from Gly,Leu, Ala, and Ile; and M is selected from His and Pro; and wherein theportion is greater than twelve contiguous amino acids of the amino acidsequence, (b) a polypeptide comprising a first motif selected from(Xaa-Hyp)_(x) (SEQ ID NO:182) and Xaa-Hyp-Xaa-Xaa-Hyp-Xaa (SEQ IDNO:183), wherein x is from 2 to 1000, (c) a polypeptide comprising asecond motif selected from Xaa-Hyp-Hyp_(n) (SEQ ID NO:209) andXaa-Pro-Hyp_(n) (SEQ ID NO:210), wherein n is from 1 to 500, and (d) apolypeptide comprising the first motif and the second motif, wherein Xaais any amino acid other than hydroxyproline. In one embodiment, thefirst sequence is a green fluorescent protein amino acid sequence.

Also provided by the invention is an isolated polynucleotide sequenceencoding at least a portion of an amino acid sequence selected from (a)A-Hyp-B-C-D-E-F-Hyp-G-H-I-Hyp-J-Hyp-Hyp-K-L-Pro-M (SEQ ID NO:136),wherein A is selected from Ser, Thr, and Ala; B is selected from Hyp,Pro, Leu, and Ile; C is selected from Pro and Hyp; D is selected fromHyp, Pro, Ser, Thr, and Ala; E is selected from Leu and Ile; F isselected from Ser, Thr, and Ala; G is selected from Ser, Leu, Hyp, Thr,Ala, and Ile; His selected from Hyp, Pro, Leu, and Ile; I is selectedfrom Thr, Ala, and Ser; J is selected from Thr, Ser, and Ala; K isselected from Thr, Leu, Hyp, Ser, Ala, and Ile; L is selected from Gly,Leu, Ala, and Ile; and M is selected from His and Pro; and wherein theportion is greater than twelve contiguous amino acids of the amino acidsequence, (b) a polypeptide comprising a first motif selected from(Xaa-Hyp)_(x) (SEQ ID NO:182) and Xaa-Hyp-Xaa-Xaa-Hyp-Xaa (SEQ IDNO:183), wherein x is from 2 to 1000, (c) a polypeptide comprising asecond motif selected from Xaa-Hyp-Hyp_(n) (SEQ ID NO:209) andXaa-Pro-Hyp_(n) (SEQ ID NO:210), wherein n is from 1 to 500, and (d) apolypeptide comprising the first motif and the second motif, wherein Xaais any amino acid other than hydroxyproline.

The invention further provides a recombinant expression vectorcomprising a polynucleotide sequence encoding a portion of an amino acidsequence selected from (a)A-Hyp-B-C-D-E-F-Hyp-G-H-I-Hyp-J-Hyp-Hyp-K-L-Pro-M (SEQ ID NO:136),wherein A is selected from Ser, Thr, and Ala; B is selected from Hyp,Pro, Leu, and Ile; C is selected from Pro and Hyp; D is selected fromHyp, Pro, Ser, Thr, and Ala; E is selected from Leu and Ile; F isselected from Ser, Thr, and Ala; G is selected from Ser, Leu, Hyp, Thr,Ala, and Ile; His selected from Hyp, Pro, Leu, and Ile; I is selectedfrom Thr, Ala, and Ser; J is selected from Thr, Ser, and Ala; K isselected from Thr, Leu, Hyp, Ser, Ala, and Ile; L is selected from Gly,Leu, Ala, and Ile; and M is selected from His and Pro; and wherein theportion is greater than twelve contiguous amino acids of the amino acidsequence, (b) a polypeptide comprising a first motif selected from(Xaa-Hyp)_(x) (SEQ ID NO:182) and Xaa-Hyp-Xaa-Xaa-Hyp-Xaa (SEQ IDNO:183), wherein x is from 2 to 1000, (c) a polypeptide comprising asecond motif selected from Xaa-Hyp-Hyp_(n) (SEQ ID NO:209) andXaa-Pro-Hyp_(n) (SEQ ID NO:210), wherein n is from 1 to 500, and (d) apolypeptide comprising the first motif and the second motif, wherein Xaais any amino acid other than hydroxyproline. In one embodiment, theexpression vector further comprises a promoter operably linked to thepolynucleotide sequence. In a preferred embodiment, the promoter is aviral promoter. In a more preferred embodiment, the viral promoter isselected from the group consisting of the 35S and 19S RNA promoters ofcauliflower mosaic virus. In an alternative preferred embodiment, theexpression vector further comprises a signal sequence selected fromextensin signal sequence (SEQ ID NO:14), and tomatoarabinogalactan-protein signal sequence (SEQ ID NO:215). In a morepreferred embodiment, the expression vector further comprises a reportergene. In a yet more preferred embodiment, the reporter gene is the greenfluorescence protein gene. In another embodiment, the vector iscontained within a host cell. In a preferred embodiment, the host cellis a plant cell. In a more preferred embodiment, the plant cellexpresses a glycoprotein comprising the portion.

Also provided herein is a method for producing at least a portion of aglycoprotein, comprising: a) providing: i) a recombinant expressionvector comprising a polynucleotide sequence encoding at least a portionof an amino acid sequence selected from (a)A-Hyp-B-C-D-E-F-Hyp-G-H-I-Hyp-J-Hyp-Hyp-K-L-Pro-M (SEQ ID NO:136),wherein A is selected from Ser, Thr, and Ala; B is selected from Hyp,Pro, Leu, and Ile; C is selected from Pro and Hyp; D is selected fromHyp, Pro, Ser, Thr, and Ala; E is selected from Leu and Ile; F isselected from Ser, Thr, and Ala; G is selected from Ser, Leu, Hyp, Thr,Ala, and Ile; His selected from Hyp, Pro, Leu, and Ile; I is selectedfrom Thr, Ala, and Ser; J is selected from Thr, Ser, and Ala; K isselected from Thr, Leu, Hyp, Ser, Ala, and Ile; L is selected from Gly,Leu, Ala, and Ile; and M is selected from His and Pro; and wherein theportion is greater than twelve contiguous amino acids of the amino acidsequence, (b) a polypeptide comprising a first motif selected from(Xaa-Hyp)_(x) (SEQ ID NO:182) and Xaa-Hyp-Xaa-Xaa-Hyp-Xaa (SEQ IDNO:183), wherein x is from 2 to 1000, (c) a polypeptide comprising asecond motif selected from Xaa-Hyp-Hyp_(n) (SEQ ID NO:209) andXaa-Pro-Hyp_(n) (SEQ ID NO:210), wherein n is from 1 to 500, and (d) apolypeptide comprising the first motif and the second motif, wherein Xaais any amino acid other than hydroxyproline; and ii) a host cell; and b)introducing the vector into the host cell under conditions such that theportion is expressed. In one embodiment, the host cell is growing inculture. In a preferred embodiment, the method further comprises thestep of c) recovering the portion from the host cell culture. In analternative embodiment, the host cell is a plant cell. In a morepreferred embodiment, the plant cell is derived from a plant selectedfrom the family Leguminoseae.

DESCRIPTION OF THE DRAWINGS

FIG. 1 shows the nucleic acid sequence (SEQ ID NO:12) of one embodimentof a synthetic gene of the present invention.

FIG. 2 shows one embodiment of a synthetic gene in one embodiment of anexpression vector.

FIG. 3 is a graph showing size-fractionation of expressed protein fromtransformed tobacco cells.

FIG. 4 is a graph showing the isolation of GA-EGFP by reverse phasechromatography.

FIG. 5 is the elution profile for dGAGP by reverse phase chromatographyon a Hamilton PRP-1 column and fractionation by gradient elution.

FIG. 6 is the elution profile for dGAGP incomplete pronase digest byreverse phase chromatography. An incomplete digest of dGAGP fractionatedon the Hamilton PRP-1 reverse phase column yielded two major peptidefractions, designated P1 and P3.

FIG. 7 is the elution profile for a chymotryptic digest of dGAGPfractionated on a Polysulfoethyl aspartamide cation exchange column.

FIG. 8 is the elution profile of dGAGP chymotryptic peptides by reversephase column chromatography of a) S1, and b) S2.

FIG. 9 shows a proposed model for an exemplary glycopeptide containingan exemplary consensus sequence.

FIG. 10 is the elution profile of the GAGP base hydrolysate by SephadexG-50 gel permeation chromatography.

FIG. 11 shows the oligonucleotide sequence (SEQ ID NOs:112, 113, 115,116, 118-121, 123 and 124) sets used to build the synthetic genes whichencode the Ser-Pro internal repeat polypeptide (SEQ ID NO:114), the GAGPinternal repeat polypeptide (SEQ ID NO:117), the 5′-linker (SEQ IDNO:122) and 3′-linker (SEQ ID NO:125).

FIG. 12 shows Superose-12 gel permeation chromatography withfluorescence detection of (A) culture medium containing(Ser-Hyp)₃₂-EGFP, (B) (GAGP)₃-EGFP medium concentrated four-fold, (C)Medium of EGFP targeted to the extracellular matrix (concentratedten-fold), and (D) 10 mg standard EGFP from Clontech.

FIG. 13 shows PRP-1 reverse-phase fractionation of the Superose-12 peakscontaining (A) (Ser-Hyp)₃₂-EGFP, (B) (GAGP)₃-EGFP, and (C)(Glyco)proteins in the medium of non-transformed tobacco cells.

FIG. 14 shows polypeptide sequences of (Ser-Hyp)₃₂-EGFP and (GAGP)₃-EGFPbefore and after deglycosylation. (A) N-terminal amino acid sequence ofthe glycoprotein, (Ser-Hyp)₃₂-EGFP, with partial sequence of both theglycoprotein (upper sequence) (SEQ ID NO:126) and its polypeptide afterdeglycosylation (lower sequence) (SEQ ID NO:127). X denotes blank cycleswhich correspond to glycosylated Hyp; glycoamino acids tend to produceblank cycles during Edman degradation, an exception being arabinosylHyp. (B) Polypeptide sequence of glycosylated (GAGP)₃-EGFP (uppersequence) (SEQ ID NO:128) and deglycosylated (GAGP)₃-EGFP (lowersequence) (SEQ ID NO:129). Residues marked with an asterisk (*) denotelow molar yields of Hyp and likely sites of arabinogalactanpolysaccharide attachment in glycosylated (GAGP)₃-EGFP. For example,yields were 480 pM Asp in the first cycle, 331 pM Scr in the second, 194pM Hyp in the third, and 508 pM Ser in the fourth cycle.

FIG. 15 is a diagram of the cloning strategy for generating repeats ofGAGP sequences.

FIG. 16 depicts the exemplary (A) nucleotide sequence (SEQ ID NO:130)and amino acid sequence (SEQ ID NO:131) of two GAGP repeats, and (B)nucleotide sequence (SEQ ID NO:132) and amino acid sequence (SEQ IDNO:133) of four GAGP repeats (SEQ ID NOs:133).

DEFINITIONS

The term “gene” refers to a DNA sequence that comprises control andcoding sequences necessary for the production of a polypeptide or itsprecursor. The polypeptide can be encoded by a full length codingsequence or by any portion of the coding sequence.

The term “nucleic acid sequence of interest” refers to any nucleic acidsequence the manipulation of which may be deemed desirable for anyreason by one of ordinary skill in the art (e.g., confer improvedqualities).

The term “wild-type” when made in reference to a gene refers to a genewhich has the characteristics of a gene isolated from a naturallyoccurring source. The term “wild-type” when made in reference to a geneproduct refers to a gene product which has the characteristics of a geneproduct isolated from a naturally occurring source. A wild-type gene isthat which is most frequently observed in a population and is thusarbitrarily designated the “normal” or “wild-type” form of the gene. Incontrast, the term “modified” or “mutant” when made in reference to agene or to a gene product refers, respectively, to a gene or to a geneproduct which displays modifications in sequence and/or functionalproperties (i.e., altered characteristics) when compared to thewild-type gene or gene product. It is noted that naturally-occurringmutants can be isolated; these are identified by the fact that they havealtered characteristics when compared to the wild-type gene or geneproduct.

The term “recombinant” when made in reference to a DNA molecule refersto a DNA molecule which is comprised of segments of DNA joined togetherby means of molecular biological techniques. The term “recombinant” whenmade in reference to a protein or a polypeptide refers to a proteinmolecule which is expressed using a recombinant DNA molecule.

As used herein, the terms “vector” and “vehicle” are usedinterchangeably in reference to nucleic acid molecules that transfer DNAsegment(s) from one cell to another.

The term “expression vector” or “expression cassette” as used hereinrefers to a recombinant DNA molecule containing a desired codingsequence and appropriate nucleic acid sequences necessary for theexpression of the operably linked coding sequence in a particular hostorganism. Nucleic acid sequences necessary for expression in prokaryotesusually include a promoter, an operator (optional), and a ribosomebinding site, often along with other sequences. Eukaryotic cells areknown to utilize promoters, enhancers, and termination andpolyadenylation signals.

The terms “targeting vector” or “targeting construct” refer tooligonucleotide sequences comprising a gene of interest flanked oneither side by a recognition sequence which is capable of homologousrecombination of the DNA sequence located between the flankingrecognition sequences.

The terms “in operable combination”, “in operable order” and “operablylinked” as used herein refer to the linkage of nucleic acid sequences insuch a manner that a nucleic acid molecule capable of directing thetranscription of a given gene and/or the synthesis of a desired proteinmolecule is produced. The term also refers to the linkage of amino acidsequences in such a manner so that a functional protein is produced.

The term “transformation” as used herein refers to the introduction offoreign DNA into cells. Transformation of a plant cell may beaccomplished by a variety of means known in the art including particlemediated gene transfer (see, e.g., U.S. Pat. No. 5,584,807 herebyincorporated by reference); infection with an Agrobacterium straincontaining the foreign DNA for random integration (U.S. Pat. No.4,940,838 hereby incorporated by reference) or targeted integration(U.S. Pat. No. 5,501,967 hereby incorporated by reference) of theforeign DNA into the plant cell genome; electroinjection (Nan et al.(1995) In “Biotechnology in Agriculture and Forestry,” Ed. Y. P. S.Bajaj, Springer-Verlag Berlin Heidelberg, Vol 34:145-155; Griesbach(1992) HortScience 27:620); fusion with liposomes, lysosomes, cells,minicells or other fusible lipid-surfaced bodies (Fraley et al. (1982)Proc. Natl. Acad. Sci. USA 79:1859-1863; polyethylene glycol (Krens etal. (1982) Nature 296:72-74); chemicals that increase free DNA uptake;transformation using virus, and the like.

The terms “infecting” and “infection” with a bacterium refer toco-incubation of a target biological sample, (e.g., cell, tissue, etc.)with the bacterium under conditions such that nucleic acid sequencescontained within the bacterium are introduced into one or more cells ofthe target biological sample.

The term “Agrobacterium” refers to a soil-borne, Gram-negative,rod-shaped phytopathogenic bacterium which causes crown gall. The term“Agrobacterium” includes, but is not limited to, the strainsAgrobacterium tumefaciens, (which typically causes crown gall ininfected plants), and Agrobacterium rhizogens (which causes hairy rootdisease in infected host plants). Infection of a plant cell withAgrobacterium generally results in the production of opines (e.g.,nopaline, agropine, octopine etc.) by the infected cell. Thus,Agrobacterium strains which cause production of nopaline (e.g., strainLBA4301, C58, A208) are referred to as “nopaline-type” Agrobacteria;Agrobacterium strains which cause production of octopine (e.g., strainLBA4404, Ach5, B6) are referred to as “octopine-type” Agrobacteria; andAgrobacterium strains which cause production of agropine (e.g., strainEHA105, EHA101, A281) are referred to as “agropine-type” Agrobacteria.

The terms “bombarding, “bombardment,” and “biolistic bombardment” referto the process of accelerating particles towards a target biologicalsample (e.g., cell, tissue, etc.) to effect wounding of the cellmembrane of a cell in the target biological sample and/or entry of theparticles into the target biological sample. Methods for biolisticbombardment are known in the art (e.g., U.S. Pat. No. 5,584,807, thecontents of which are herein incorporated by reference), and arecommercially available (e.g., the helium gas-driven microprojectileaccelerator (PDS-1000/He) (BioRad).

The term “microwounding” when made in reference to plant tissue refersto the introduction of microscopic wounds in that tissue. Microwoundingmay be achieved by, for example, particle or biolistic bombardment.

The term “transgenic” when used in reference to a plant cell refers to aplant cell which comprises a transgene, or whose genome has been alteredby the introduction of a transgene. The term “transgenic” when used inreference to a plant refers to a plant which comprises one or more cellswhich contain a transgene, or whose genome has been altered by theintroduction of a transgene. These transgenic cells and transgenicplants may be produced by several methods including the introduction ofa “transgene” comprising nucleic acid (usually DNA) into a target cellor integration into a chromosome of a target cell by way of humanintervention, such as by the methods described herein.

The term “transgene” as used herein refers to any nucleic acid sequencewhich is introduced into the genome of a plant cell by experimentalmanipulations. A transgene may be an “endogenous DNA sequence,” or a“heterologous DNA sequence” (i.e., “foreign DNA”). The term “endogenousDNA sequence” refers to a nucleotide sequence which is naturally foundin the cell into which it is introduced so long as it does not containsome modification (e.g., a point mutation, the presence of a selectablemarker gene, etc.) relative to the naturally-occurring sequence. Theterm “heterologous DNA sequence” refers to a nucleotide sequence whichis ligated to, or is manipulated to become ligated to, a nucleic acidsequence to which it is not ligated in nature, or to which it is ligatedat a different location in nature. Heterologous DNA is not endogenous tothe cell into which it is introduced, but has been obtained from anothercell. Heterologous DNA also includes an endogenous DNA sequence whichcontains some modification. Generally, although not necessarily,heterologous DNA encodes RNA and proteins that are not normally producedby the cell into which it is expressed. Examples of heterologous DNAinclude mutated wild-type genes (i.e., wild-type genes that have beenmodified such that they are no longer wild-type genes), reporter genes,transcriptional and translational regulatory sequences, selectablemarker proteins (e.g., proteins which confer drug resistance), etc.

As used herein, the term “probe” when made in reference to anoligonucleotide (i.e., a sequence of nucleotides) refers to anoligonucleotide, whether occurring naturally as in a purifiedrestriction digest or produced synthetically, recombinantly or by PCRamplification, which is capable of hybridizing to anotheroligonucleotide of interest. A probe may be single-stranded ordouble-stranded. Probes are useful in the detection, identification andisolation of particular gene sequences. Oligonucleotide probes may belabelled with a “reporter molecule,” so that the probe is detectableusing a detection system. Detection systems include, but are not limitedto, enzyme, fluorescent, radioactive, and luminescent systems.

The term “selectable marker” as used herein, refer to a gene whichencodes an enzyme having an activity that confers resistance to anantibiotic or drug upon the cell in which the selectable marker isexpressed. Selectable markers may be “positive” or “negative.” Examplesof positive selectable markers include the neomycin phosphotrasferase(NPTII) gene which confers resistance to G418 and to kanamycin, and thebacterial hygromycin phosphotransferase gene (hyg), which confersresistance to the antibiotic hygromycin. Negative selectable markersencode an enzymatic activity whose expression is cytotoxic to the cellwhen grown in an appropriate selective medium. For example, the HSV-tkgene is commonly used as a negative selectable marker. Expression of theHSV-tk gene in cells grown in the presence of gancyclovir or acycloviris cytotoxic; thus, growth of cells in selective medium containinggancyclovir or acyclovir selects against cells capable of expressing afunctional HSV TK enzyme.

The terms “promoter element,” “promoter,” or “promoter sequence” as usedherein, refer to a DNA sequence that is located at the 5′ end (i.e.precedes) the protein coding region of a DNA polymer. The location ofmost promoters known in nature precedes the transcribed region. Thepromoter functions as a switch, activating the expression of a gene. Ifthe gene is activated, it is said to be transcribed, or participating intranscription. Transcription involves the synthesis of mRNA from thegene. The promoter, therefore, serves as a transcriptional regulatoryelement and also provides a site for initiation of transcription of thegene into mRNA.

The term “amplification” is defined as the production of additionalcopies of a nucleic acid sequence and is generally carried out usingpolymerase chain reaction technologies well known in the art[Dieffenbach C W and G S Dveksler (1995) PCR Primer, a LaboratoryManual, Cold Spring Harbor Press, Plainview N.Y.]. As used herein, theterm “polymerase chain reaction” (“PCR”) refers to the method disclosedin U.S. Pat. Nos. 4,683,195, 4,683,202 and 4,965,188, all of which arehereby incorporated by reference, which describe a method for increasingthe concentration of a segment of a target sequence in a mixture ofgenomic DNA without cloning or purification. This process for amplifyingthe target sequence consists of introducing a large excess of twooligonucleotide primers to the DNA mixture containing the desired targetsequence, followed by a precise sequence of thermal cycling in thepresence of a DNA polymerase. The two primers are complementary to theirrespective strands of the double stranded target sequence. To effectamplification, the mixture is denatured and the primers then annealed totheir complementary sequences within the target molecule. Followingannealing, the primers are extended with a polymerase so as to form anew pair of complementary strands. The steps of denaturation, primerannealing and polymerase extension can be repeated many times (i.e.,denaturation, annealing and extension constitute one “cycle”; there canbe numerous “cycles”) to obtain a high concentration of an amplifiedsegment of the desired target sequence. The length of the amplifiedsegment of the desired target sequence is determined by the relativepositions of the primers with respect to each other, and therefore, thislength is a controllable parameter. By virtue of the repeating aspect ofthe process, the method is referred to as the “polymerase chainreaction” (hereinafter “PCR”). Because the desired amplified segments ofthe target sequence become the predominant sequences (in terms ofconcentration) in the mixture, they are said to be “PCR amplified.”

With modern methods of PCR, it is possible to amplify a single copy of aspecific target sequence in genomic DNA to a level detectable by severaldifferent methodologies (e.g., hybridization with a labeled probe;incorporation of biotinylated primers followed by avidin-enzymeconjugate detection; and/or incorporation of ³²P-labeleddeoxyribonucleotide triphosphates, such as dCTP or dATP, into theamplified segment). In addition to genomic DNA, any oligonucleotidesequence can be amplified with the appropriate set of primer molecules.In particular, the amplified segments created by the PCR process itselfare, themselves, efficient templates for subsequent PCR amplifications.Amplified target sequences may be used to obtain segments of DNA (e.g.,genes) for the construction of targeting vectors, transgenes, etc.

The present invention contemplates using amplification techniques suchas PCR to obtain the cDNA (or portions thereof) of plant genes encodingplant gums and other hydroxyproline-rich polypeptides. In oneembodiment, primers are designed using the synthetic gene sequences(e.g. containing sequences encoding particular motifs) described hereinand PCR is carried out (using genomic DNA or other source of nucleicacid from any plant capable of producing a gum exudate) under conditionsof low stringency. In another embodiment, PCR is carried out under highstringency. The amplified products can be run out on a gel and isolatedfrom the gel.

The term “hybridization” as used herein refers to any process by which astrand of nucleic acid joins with a complementary strand through basepairing [Coombs J (1994) Dictionary of Biotechnology, Stockton Press,New York N.Y.].

As used herein, the terms “complementary” or “complementarity” when usedin reference to polynucleotides refer to polynucleotides which arerelated by the base-pairing rules. For example, for the sequence5′-AGT-3′ is complementary to the sequence 5′-ACT-3′. Complementaritymay be “partial,” in which only some of the nucleic acids' bases arematched according to the base pairing rules. Or, there may be “complete”or “total” complementarity between the nucleic acids. The degree ofcomplementarity between nucleic acid strands has significant effects onthe efficiency and strength of hybridization between nucleic acidstrands. This is of particular importance in amplification reactions, aswell as detection methods which depend upon binding between nucleicacids.

The term “homology” when used in relation to nucleic acids refers to adegree of complementarity. There may be partial homology or completehomology (i.e., identity). A partially complementary sequence is onethat at least partially inhibits a completely complementary sequencefrom hybridizing to a target nucleic acid is referred to using thefunctional term “substantially homologous.” The inhibition ofhybridization of the completely complementary sequence to the targetsequence may be examined using a hybridization assay (Southern orNorthern blot, solution hybridization and the like) under conditions oflow stringency. A substantially homologous sequence or probe willcompete for and inhibit the binding (i.e., the hybridization) of asequence which is completely homologous to a target under conditions oflow stringency. This is not to say that conditions of low stringency aresuch that non-specific binding is permitted; low stringency conditionsrequire that the binding of two sequences to one another be a specific(i.e., selective) interaction. The absence of non-specific binding maybe tested by the use of a second target which lacks even a partialdegree of complementarity (e.g., less than about 30% identity); in theabsence of non-specific binding the probe will not hybridize to thesecond non-complementary target.

Low stringency conditions when used in reference to nucleic acidhybridization comprise conditions equivalent to binding or hybridizationat 68° C. in a solution consisting of 5×SSPE (43.8 g/l NaCl, 6.9 g/lNaH₂PO₄.H₂O and 1.85 g/l EDTA, pH adjusted to 7.4 with NaOH), 1% SDS,5×Denhardt's reagent [50×Denhardt's contains the following per 500 ml: 5g Ficoll (Type 400, Pharmacia), 5 g BSA (Fraction V; Sigma)] and 100μg/ml denatured salmon sperm DNA followed by washing in a solutioncomprising 0.2×SSPE, and 0.1% SDS at room temperature when a DNA probeof about 100 to about 1000 nucleotides in length is employed.

High stringency conditions when used in reference to nucleic acidhybridization comprise conditions equivalent to binding or hybridizationat 68° C. in a solution consisting of 5×SSPE, 1% SDS, 5×Denhardt'sreagent and 100 μg/ml denatured salmon sperm DNA followed by washing ina solution comprising 0.1×SSPE, and 0.1% SDS at 68° C. when a probe ofabout 100 to about 1000 nucleotides in length is employed.

The term “equivalent” when made in reference to a hybridizationcondition as it relates to a hybridization condition of interest meansthat the hybridization condition and the hybridization condition ofinterest result in hybridization of nucleic acid sequences which havethe same range of percent (%) homology. For example, if a hybridizationcondition of interest results in hybridization of a first nucleic acidsequence with other nucleic acid sequences that have from 50% to 70%homology to the first nucleic acid sequence, then another hybridizationcondition is said to be equivalent to the hybridization condition ofinterest if this other hybridization condition also results inhybridization of the first nucleic acid sequence with the other nucleicacid sequences that have from 50% to 70% homology to the first nucleicacid sequence.

When used in reference to nucleic acid hybridization the art knows wellthat numerous equivalent conditions may be employed to comprise eitherlow or high stringency conditions; factors such as the length and nature(DNA, RNA, base composition) of the probe and nature of the target (DNA,RNA, base composition, present in solution or immobilized, etc.) and theconcentration of the salts and other components (e.g., the presence orabsence of formamide, dextran sulfate, polyethylene glycol) areconsidered and the hybridization solution may be varied to generateconditions of either low or high stringency hybridization differentfrom, but equivalent to, the above listed conditions.

“Stringency” when used in reference to nucleic acid hybridizationtypically occurs in a range from about T_(m)−5° C. (5° C. below theT_(m) of the probe) to about 20° C. to 25° C. below T_(m). As will beunderstood by those of skill in the art, a stringent hybridization canbe used to identify or detect identical polynucleotide sequences or toidentify or detect similar or related polynucleotide sequences. Under“stringent conditions” a nucleic acid sequence of interest willhybridize to its exact complement and closely related sequences.

As used herein, the term “fusion protein” refers to a chimeric proteincontaining the protein of interest (i.e., GAGP and fragments thereof)joined to an exogenous protein fragment (the fusion partner whichconsists of a non-GAGP sequence). The fusion partner may provide adetectable moiety, may provide an affinity tag to allow purification ofthe recombinant fusion protein from the host cell, or both. If desired,the fusion protein may be removed from the protein of interest (i.e.,GAGP protein or fragments thereof) by a variety of enzymatic or chemicalmeans known to the art. In an alternative embodiment, the fusionproteins of the invention may be used as substrates for plant glycosyltransfgerases. For example after deglycosylation, the exemplary(Ser-Hyp)₃₂-EGFP (see Example 23) may be used as an acceptor forgalactose addition, with UDP-galactose as co-substrate, catalyzed bygalactosyl transferase. The fusion partner EGFP allows facile isolationof the newly galactosyalted polypeptide. Fusion proteins containingsequences of the invention may be isolated using methods known in theart, such as gel filtration (Example 22), hydrophobic interactionchromatograph (HIC), reverse phase chromatography, and anion exchangechromatography.

As used herein the term “non-gum arabic glycoprotein” or “non-gum arabicglycoprotein sequence” refers to that portion of a fusion protein whichcomprises a protein or protein sequence which is not derived from a gumarabic glycoprotein.

The term “protein of interest” as used herein refers to the proteinwhose expression is desired within the fusion protein. In a fusionprotein the protein of interest (e.g., GAGP) will be joined or fusedwith another protein or protein domain (e.g., GFP), the fusion partner,to allow for enhanced stability of the protein of interest and/or easeof purification of the fusion protein.

As used herein, the term “purified” or “to purify” refers to the removalof contaminants from a sample. For example, recombinant HRGPpolypeptides, including HRGP-GFP fusion proteins are purified by theremoval of host cell components such as nucleic acids,lipopolysaccharide (e.g., endotoxin). “Substantially purified” moleculesare at least 60% free, preferably at least 75% free, and more preferablyat least 90% free from other components with which they are naturallyassociated.

The term “recombinant DNA molecule” as used herein refers to a DNAmolecule which is comprised of segments of DNA joined together by meansof molecular biological techniques.

The term “recombinant protein” or “recombinant polypeptide” as usedherein refers to a protein molecule which is expressed from arecombinant DNA molecule.

As used herein the term “portion” when in reference to a protein (as in“a portion of a given protein”) refers to fragments of that protein. Thefragments may range in size from four (4) amino acid residues to theentire amino acid sequence minus one amino acid. Thus, a portion of anamino acid sequence which is 30 nucleotides long refers to any fragmentof that sequence which ranges in size from 4 to 29 contiguous aminoacids of that sequence. A polypeptide comprising “at least a portion of”an amino acid sequence comprises from four (4) contiguous amino acidresidues of the amino acid sequence to the entire amino acid sequence.When made in reference to a nucleic acid sequence, the term “portion”means a fragment which ranges in size from twelve (12) nucleic acids tothe entire nucleic acid sequence minus one nucleic acid. Thus, a nucleicacid sequence comprising “at least a portion of” a nucleotide sequencecomprises from twelve (12) contiguous nucleotide residues of thenucleotide sequence to the entire nucleotide sequence.

The term “isolated” when used in relation to a nucleic acid, as in “anisolated nucleic acid sequence” refers to a nucleic acid sequence thatis identified and separated from at least one contaminant nucleic acidwith which it is ordinarily associated in its natural source.

The terms “motif” and “module” are equivalent terms when made inreference to an amino acid sequence, and refer to the particular type,number, and arrangement of amino acids in that sequence.

The term “glycomodule” refers to a glycopeptide in which thecarbohydrate portion is covalently linked to an amino acid sequencemotif.

The term “repeating sequence” when made in reference to a peptidesequence that is contained in a polypeptide sequence means that thepeptide sequence is reiterated from 1 to 10 times, more preferably from1 to 100 times, and most preferably from 1 to 1000 times, in thepolypeptide sequence. The repeats of the peptide sequence may benon-contiguous or contiguous. The term “non-contiguous repeat” when madein reference to a repeating peptide sequence means that at least oneamino acid (or amino acid analog) is placed between the repeatingsequences. The term “contiguous repeat” when made in reference to arepeating peptide sequence means that there are no intervening aminoacids (or amino acid analogs) between the repeating sequences.

GENERAL DESCRIPTION OF THE INVENTION

The present invention relates generally to the field of plant gums andother hydroxyproline-rich glycoproteins, and in particular, to theexpression of synthetic genes designed from repetitive peptidesequences. The hydroxyproline-rich glycoprotein (HRGP) superfamily isubiquitous in the primary cell wall or extracellular matrix throughoutthe plant kingdom. Family members are diverse in structure andimplicated in all aspects of plant growth and development. This includesplant responses to stress imposed by pathogenesis and mechanicalwounding.

Plant HRGPs have no known animal homologs. Furthermore, hydroxyprolineresidues are O-glycosylated in plant glycoproteins but never in animals.At the molecular level the function of these unique plant glycoproteinsremains largely unexplored.

HRGPS are, to a lesser or greater extent, extended, repetitive, modularproteins. The modules are small (generally 4-6 amino acid residuemotifs), usually glycosylated, with most HRGPs being made up of morethan one type of repetitive module. For purposes of constructing thesynthetic genes of the present invention, it is useful to view theglycosylated polypeptide modules not merely as peptides oroligosaccharides but as small functional motifs.

The description of the invention involves A) the design of thepolypeptide of interest, B) the production of synthetic genes encodingthe polypeptide of interest, C) the construction of the expressionvectors, D) selection of the host cells, E) introduction of theexpression construct into a particular cell (whether in vitro or invivo), F) preferred consensus sequences and portions thereof, and G)O-glycosylation codes.

A. Design of the Polypeptide of Interest

The present invention contemplates polypeptides that are fragments ofhydroxyproline-rich glycoproteins (HRGPs), repetitive proline-richproteins (RPRPs) and arabinogalactan-proteins (AGPs). The presentinvention contemplates portions of HRGPs comprising one or more of thehighly conserved Ser-Hyp, (SEQ ID NO:3) motif(s). The present inventionalso contemplates portions of RPRPs comprising one or more of thepentapeptide motif: Pro-Hyp-Val-Tyr-Lys (SEQ ID NO:4). The presentinvention also contemplates portions of AGPs comprising one or moreXaa-Hyp-Xaa-Hyp (SEQ ID NO:9) repeats.

While an understanding of the natural mechanism of glycosylation is notrequired for the successful operation of the present invention, it isbelieved that in GAGP and other HRGPs, repetitive Xaa-Hyp motifsconstitute a Hyp-glycosylation code where Hyp occurring in contiguousmotifs (Xaa-Hyp-Hyp) and Hyp occurring in non-contiguous Hyp repeats isrecognized by different enzymes: arabinosyltransferases andgalactosyltransferases, respectively.

The RPRPs (and some nodulins) consist of short repetitive motifs (e.g.Soybean RPRP1: [POVYK]_(n) where O=Hyp) containing the least amount ofcontiguous Hyp. They also exemplify the low end of the glycosylationrange with relatively few Hyp residues arabinosylated and noarabinogalactan polysaccharide. For example, in soybean RPRP1,L-arabinofuranose is attached to perhaps only a single Hyp residue inthe molecule.

The Extensins occupy an intermediate position in the glycosylationcontinuum, containing about 50% carbohydrate which occurs mainly asHyp-arabinosides (1-4 Ara residues), but not as Hyp-arabinogalactanpolysaccharide. Extensins contain the repetitive, highly arabinosylated,diagnostic Ser-Hyp₄ (SEQ ID NO:3) glycopeptide module. The precisefunction of this module is unknown, but earlier work indicates thatthese motifs of arabinosylated Hyp help stabilize the extendedpolyproline-II helix of the extensins. Monogalactose also occurs on theSer residues.

The classical Ser-Hyp₄ (SEQ ID NO:3) glycopeptide module is of specialinterest. A tetra-L-arabinofuranosyl oligosaccharide is attached to eachHyp residue in the motif. Three uniquely b-linked arabinofuranosylresidues and an a-linked nonreducing terminus comprise thetetraarabinooligosaccharide. While an understanding of the naturalmechanism of glycosylation is not required for the successful operationof the present invention, it is believed that the arabinosylated Hypresidues together with the single galactosyl-serine residue undoubtedlyform a unique molecular surface topography which interacts with and isrecognized by other wall components, possibly including itself. Shortermotifs of Hyp, namely Hyp₃ and Hyp₂, lack the fourth (a-linked)arabinose residue, again suggesting that the fourth Ara unique to theHyp₄ motif, has a special role and is presented for recognition orcleavage.

Tetra-arabinose and tri-arabinose are attached to known tetra-Hypmotifs. Those Ser-Hyp₄ isolated from native extensins have every Hypresidue arabinosylated. However, the Ser-Hyp₄ repeats fused to EGFP asdisclosed herein showed that some Hyp residues were nonglycosylated,while some were mono- and di-arabinosylated. Mainly, the Hyp residueswere tri-arabinosylated and tetra-arabinosylated. For example, Hyp-Ara₄was 31% of total Hyp, Hyp-Ara₃ was 52% of total Hyp, Hyp-Ara₂ was 8% oftotal Hyp, and Hyp-Ara was 2% of total Hyp. 7% of the total Hyp was notglyscosylated. Most of the serine residues in the invention's exemplarySer-Hyp₄ repeats fused to EGFP were not galactosylated. This is incontrast to naturally occurring Ser-Hyp_(o) in which Ser is oftenmono-galactosylated. Importantly, Hyp-polysaccharide were never detectedby the inventors in the Ser-Hyp₄ repeats fused to EGFP.

At the high end of the glycosylation range (˜90% sugar), thearabinogalactan-proteins (AGPs) and the related gum arabic glycoprotein(GAGP) are uniquely glycosylated with arabinogalactan polysaccharides.GAGP and all AGPs so far characterized by Hyp-glycoside profiles containHyp-linked arabinosides assigned to contiguous Hyp residues by the Hypcontiguity hypothesis. However these glycoproteins also uniquely containXaa-Hyp-Xaa-Hyp (SEQ ID NO:9) repeats. These repeats are putativepolysaccharide attachment sites.

The present invention contemplates in particular fragments of gum arabicglycoprotein (GAGP). As noted above, GAGP has been largely refractory tochemical analysis. Prior to the inventors' discovery of the sequencesdisclosed herein, the largest peptide obtained and sequenced from gumarabic was a peptide of twelve (12) amino acids having the sequenceSer-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Hyp-Gly-Pro (SEQ ID NO:13). C. L.Delonnay, “Determination of the Protein Constituent Of Gum Arabic”Master of Science Thesis (1993). The present invention contemplatesusing this Delonnay sequence as well as (heretofore undescribed) largerpeptide fragments of GAGP (and variants thereof) for the design ofsynthetic genes. In this manner, “designer plant gums” can be produced(“designer extensins” are also contemplated).

In one embodiment, the present invention contemplates a substantiallypurified polypeptide comprising at least a portion of the amino acidconsensus sequenceSer-Hyp-Hyp-Hyp-[Hyp/Thr]-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Hyp-Gly-Pro-His(SEQ ID NO:1 and SEQ ID NO:2) or variants thereof. While anunderstanding of the natural mechanism of glycosylation is not requiredfor the successful operation of the present invention, it is believedthat this GAGP 19-amino acid consensus repeat (which contains bothcontiguous Hyp and non-contiguous Hyp repeats) is glycosylated in nativeGAGP with both Hyp-arabinosides and Hyp-polysaccharide in molar ratios.It is further believed that the high molecular weight protein componentof gum arabic (i.e. GAGP) is responsible for the remarkable (andadvantageous) emulsifying and stabilizing activity exploited by the foodand soft drink industries.

The sequences of the invention may be used to isolate hydroxyprolinerich glycoprotein-binding molecules. For example, polypeptides encodedby the invention's polynucloetide sequences may be immobilized(covalently or non-covalently) on solid supports or resins for use inisolating HRGP-binding molecules from a variety of sources (e.g. algae,plants, animals, microorganisms). Generic methods for immobilizingpolypeptides are known in the art using commercially available kits. Forexample, the desired polypeptide sequence may be expressed as a fusionprotein with heterologous protein A which allows immobilization of thefusion protein on immobilized immunoglobulin. Additionally, pGEX vectors(Promega, Madison Wis.) may be used to express the desired polypeptidesas a fusion protein with glutathione S-transferase (GST) which may beadsorbed to glutathione-agarose beads.

The invention's sequences may also be used to make polyclonal andmonoclonal antibodies. Generic methods for generating polyclonal andmonoclonal antibodies are known in the art. For example, monoclonalantibodies may be generated using the methods of Kohler and Milstein(1976) Eur. J. Immunol. 6:511-519 (Exhibit B) and of J. Goding (1986) In“Monoclonal Antibodies: Principles and Practice,” Academic Press, pp59-103.

B. Production of Synthetic Genes

The present invention contemplates the use of synthetic genes engineeredfor the expression of repetitive glycopeptide modules in cells,including but not limited to callus and suspension cultures. It is notintended that the present invention be limited by the precise number ofrepeats.

In one embodiment, the present invention contemplates the nucleic acidsequences encoding the consensus sequence for GAGP (i.e. SEQ ID NO: 1and SEQ ID NO:2) or variants thereof may be used as a repeating sequencebetween two (2) and up to fifty (50) times, more preferably between ten(10) and up to thirty (30) times, and most preferably approximatelytwenty (20) times. The nucleic acid sequence encoding the consensussequence (i.e. SEQ ID NO:1 and SEQ ID NO:2) or variants thereof may beused as contiguous repeats or may be used as non-contiguous repeats.

In designing any HRGP gene cassette the following guidelines areemployed:

1) Minimization of the repetitive nature of the coding sequence whilestill taking into account the HRGP codon bias of the host plant (e.g.,when tomato is the host plant, the codon usage bias of the tomato whichfavors CCA and CCT [but not CCG] for Pro residues, and TCA and TCC forSer residues is employed). Zea mays (such as corn) and perhaps othergraminaceous monocotyledons (e.g. rice barley, wheat and all grasses)prefer CCG and CCC for proline; GTC and CTT for valine; and AAG forlysine. Dicotyledons (including legumes) prefer CCA and CCT for prolineand TCA and TCT for serine.

2) Minimization of strict sequence periodicity.

3) Non-palindromic ends are used for the monomers and end linkers toassure proper “head-to-tail” polymerization.

4) The constructs contain no internal restriction enzyme recognitionsites for the restriction enzymes employed for the insertion of thesesequences into expression vectors or during subsequent manipulations ofsuch vectors. Typically, the 5′ linker contains a XmaI site downstreamof the BamHI site used for cloning into the cloning vector (e.g.,pBluescript). The XmaI site is used for insertion of the HRGP genecassette into the expression vector pBI121-Sig-EGFP). Typically, the 3′linker contains a AgeI site upstream of the EcoRI site used for cloninginto the cloning vector (e.g., pBluescript). The AgeI site is used forinsertion of the HRGP gene cassette into the expression vector. [Forplasmid pBI121-Sig—which does not contain GFP for the fusion protein—thesame signal sequence (SS) is used, but the 3′ linkers contain an Sst Irestriction site for insertion as an Xma I/Sst I fragment behind thesignal sequence and before the NOS terminator].

5) The oligonucleotides used are high quality (e.g., from GibcoBRL,Operon) and have been purified away from unwanted products of thesynthesis.

6) The T_(M) of correctly aligned oligomers is greater than the T_(M) ofpossible dimers, hairpins or crossdimers.

One of skill in the art appreciates that the hydroxyproline (Hyp)residues in the sequences of the invention are produced as the result ofpost-translational modification of proline (Pro) residues in thepolypeptide which is encoded by the gene. Thus, where a hydroxyprolineresidue is desired to be present in the sequences of the invention, thecorresponding codon would be selected to encode proline. The Edmandegradation may be used to identify which Pro residues had beenhydroxylated to Hyp as described in Example 23, infra.

C. Construction of Expression Vectors

It is not intended that the present invention be limited by the natureof the expression vector. A variety of vectors are contemplated. In oneembodiment, two plant transformation vectors are prepared, both derivedfrom pBI121 (Clontech). Both contain an extensin signal sequence (SS)for transport of the constructs through the ER/Golgi forposttranslational modification. A first plasmid construct containedGreen Fluorescent Protein (GFP) as a reporter protein instead of GUS. Asecond plasmid does not contain GFP.

pBI121 is the Jefferson vector in which the BamHI and SstI sites can beused to insert foreign DNA between the 35S CaMV promoter and thetermination/polyadenylation signal from the nopaline synthase gene(NOS-ter) of the Agrobacterium Ti plasmid); it also contains an RK2origin of replication, a kanamycin resistance gene, and the GUS reportergene.

Signal Sequences. As noted above, the GUS sequence is replaced (viaBamHI/SstI) with a synthetic DNA sequence encoding a peptide signalsequence based on the extensin signal sequences of Nicotianaplumbaginifolia and N. tabacum

MGKMASLFATFLVVLVSLSLAQTTRVVPVASSAP (SEQ ID NO: 14)The DNA sequence also contains 15 bp of the 5′ untranslated region, andrestriction sites for Bam H1 in its 5′ terminus and Sst I in its extreme3′ terminus for insertion into pBI121 in place of GUS. An XmaIrestriction site occurs 16 bp upstream from the Sst I site to allowsubsequent insertion of EGFP into the plasmid as a Xma I/Sst I fragment.

The sequence underlined above targets N. plumbaginifolia extensin fusionproteins through the ER and Golgi for post-translational modifications,and finally to the wall. The signal sequence proposed also involvestransport of extensins and extensin modules in the same plant family(Solanaceae). Alternatively, one can use the signal sequence from tomatoP1 extensin itself.

TABLE 1 GFP MUTANTs WAVELENGTH (nm) MUTANT Excitation Emitting mGFPX10;F99S, M153T, Excites at 395 V163A mGFPX10-5 Excites at 489 Emits at 508GFPA2; I167T Excites at 471 GFPB7; Y66H Excites at 382 Emits at 440(blue fluorescence) GFPX10-C7; F99S, M153T, Excites at 395 V163A, I167T,S175G and 473 GFPX10-D3; F99S, M153T, Excites at 382 Emits at 440 V163A,Y66H

In yet another alternative, the tomato arabinogalactan-protein(Le-AGP-1) signal sequence may be used. This sequence has previouslybeen cloned [Li (1996) “Isolation and characterization of genes andcomplementary DNAs encoding a tomato arabiogalactan protein, PhD.Dissertation, Ohio University, Athens, Ohio] and encodes the proteinsequence MDRKFVFLVSILCIVVASVTG (SEQ ID NO:215). This sequence hassuccessfully been used by the inventors to target expression of theinventions's sequences to the extracellular medium of tobacco cellcultures and is being used to target (Ala-Pro)_(n)-EGFP and(Thr-Pro)_(n)-EGFP to the extracellular matrix of tobacco cell cultures.

Addition of GFP. The repetitive HRGP-modules can be expressed as GFPfusion products rather than GUS fusions, and can also be expressed asmodules without GFP. Fusion with a green fluorescent protein reportergene appropriately red-shifted for plant use, e.g. EGFP (an S65T variantrecommended for plants by Clontech) or other suitable mutants (see Table1 above) allows the detection of <700 GFP molecules at the cell surface.GFP requires aerobic conditions for oxidative formation of thefluorophore. It works well at the lower temperatures used for plant cellcultures and normally it does not adversely affect protein functionalthough it may allow the regeneration of plants only when targeted tothe ER.Promoters. As noted above, it is not intended that the present inventionbe limited by the nature of the promoter(s) used in the expressionconstructs. The CaMV35S promoter is preferred, although it is notentirely constitutive and expression is “moderate”. In some embodiments,higher expression of the constructs is desired to enhance the yield ofHRGP modules; in such cases a plasmid with “double” CaMV35S promoters isemployed.

D. Selection of Host Cells

A variety of host cells are contemplated (both eukaryotic andprokaryotic). It is not intended that the present invention be limitedby the host cells used for expression of the synthetic genes of thepresent invention. Plant host cells are preferred, including but notlimited to legumes (e.g. soy beans) and solanaceous plants (e.g.tobacco, tomato, etc.). Other cells contemplated to be within the scopeof this invention are green algae [e.g., Chlamydomonas, Volvox, andduckweed (Lemna)].

The present invention is not limited by the nature of the plant cells.All sources of plant tissue are contemplated. In one embodiment, theplant tissue which is selected as a target for transformation withvectors which are capable of expressing the invention's sequences arecapable of regenerating a plant. The term “regeneration” as used herein,means growing a whole plant from a plant cell, a group of plant cells, aplant part or a plant piece (e.g., from seed, a protoplast, callus,protocorm-like body, or tissue part). Such tissues include but are notlimited to seeds. Seeds of flowering plants consist of an embryo, a seedcoat, and stored food. When fully formed, the embryo consists basicallyof a hypocotyl-root axis bearing either one or two cotyledons and anapical meristem at the shoot apex and at the root apex. The cotyledonsof most dicotyledons are fleshy and contain the stored food of the seed.In other dicotyledons and most monocotyledonss, food is stored in theendosperm and the cotyledons function to absorb the simpler compoundsresulting from the digestion of the food.

Species from the following examples of genera of plants may beregenerated from transformed protoplasts: Fragaria, Lotus, Medicago,Onobrychis, Trifolium, Trigonella, Vigna, Citrus, Linum, Geranium,Manihot, Daucus, Arabidopsis, Brassica, Raphanus, Sinapis, Atropa,Capsicum, Hyoscyamus, Lycopersicon, Nicotiana, Solanum, Petunia,Digitalis, Majorana, Ciohorium, Helianthus, Lactuca, Bromus, Asparagus,Antirrhinum, Hererocallis, Nemesia, Pelargonium, Panicum, Pennisetum,Ranunculus, Senecio, Salpiglossis, Cucumis, Browaalia, Glycine, Lolium,Zea, Triticum, Sorghum, and Datura.

For regeneration of transgenic plants from transgenic protoplasts, asuspension of transformed protoplasts or a petri plate containingtransformed explants is first provided. Callus tissue is formed andshoots may be induced from callus and subsequently rooted.Alternatively, somatic embryo formation can be induced in the callustissue. These somatic embryos germinate as natural embryos to formplants. The culture media will generally contain various amino acids andplant hormones, such as auxin and cytokinins. It is also advantageous toadd glutamic acid and proline to the medium, especially for such speciesas corn and alfalfa. Efficient regeneration will depend on the medium,on the genotype, and on the history of the culture. These threevariables may be empirically controlled to result in reproducibleregeneration.

Plants may also be regenerated from cultured cells or tissues.Dicotyledonous plants which have been shown capable of regeneration fromtransformed individual cells to obtain transgenic whole plants include,for example, apple (Malus pumila), blackberry (Rubus),Blackberry/raspberry hybrid (Rubus), red raspberry (Rubus), carrot(Daucus carota), cauliflower (Brassica oleracea), celery (Apiumgraveolens), cucumber (Cucumis sativus), eggplant (Solanum melongena),lettuce (Lactuca sativa), potato (Solanum tuberosum), rape (Brassicanapus), wild soybean (Glycine canescens), strawberry(Fragaria×ananassa), tomato (Lycopersicon esculentum), walnut (Juglansregia), melon (Cucumis melo), grape (Vitis vinifera), and mango(Mangifera indica). Monocotyledonous plants which have been showncapable of regeneration from transformed individual cells to obtaintransgenic whole plants include, for example, rice (Oryza sativa), rye(Secale cereale), and maize.

In addition, regeneration of whole plants from cells (not necessarilytransformed) has also been observed in: apricot (Prunus armeniaca),asparagus (Asparagus officinalis), banana (hybrid Musa), bean (Phaseolusvulgaris), cherry (hybrid Prunus), grape (Vitis vinifera), mango(Mangifera indica), melon (Cucumis melo), ochra (Abelmoschusesculentus), onion (hybrid Allium), orange (Citrus sinensis), papaya(Carrica papaya), peach (Prunus persica), plum (Prunus domestica), pear(Pyrus communis), pineapple (Ananas comosus), watermelon (Citrullusvulgaris), and wheat (Triticum acstivum).

The regenerated plants are transferred to standard soil conditions andcultivated in a conventional manner. After the expression vector isstably incorporated into regenerated transgenic plants, it can betransferred to other plants by vegetative propagation or by sexualcrossing. For example, in vegetatively propagated crops, the maturetransgenic plants are propagated by the taking of cuttings or by tissueculture techniques to produce multiple identical plants. In seedpropagated crops, the mature transgenic plants are self crossed toproduce a homozygous inbred plant which is capable of passing thetransgene to its progeny by Mendelian inheritance. The inbred plantproduces seed containing the nucleic acid sequence of interest. Theseseeds can be grown to produce plants that would produce the desiredpolypeptides. The inbred plants can also be used to develop new hybridsby crossing the inbred plant with another inbred plant to produce ahybrid.

It is not intended that the present invention be limited to only certaintypes of plants. Both monocotyledons and dicotyledons are contemplated.Monocotyledons include grasses, lilies, irises, orchids, cattails,palms, Zea mays (such as corn), rice barley, wheat and all grasses.Dicotyledons include almost all the familiar trees and shrubs (otherthan confers) and many of the herbs (non-woody plants).

Tomato cultures are the ideal recipients for repetitive HRGP modules tobe hydroxylated and glycosylated: Tomato is readily transformed. Thecultures produce cell surface HRGPs in high yields easily eluted fromthe cell surface of intact cells and they possess the requiredposttranslational enzymes unique to plants—HRGP prolyl hydroxylases,hydroxyproline O-glycosyltransferases and other specificglycosyltransferases for building complex polysaccharide side chains.Furthermore, tomato genetics, and tomato leaf disctransformation/plantlet regeneration are well worked out.

Other preferred recipients for the invention's sequences include tobaccocultured cells and plants.

E. Introduction of Nucleic Acid

Expression constructs of the present invention may be introduced intohost cells (e.g. plant cells) using methods known in the art. In oneembodiment, the expression constructs are introduced into plant cells byparticle mediated gene transfer. Particle mediated gene transfer methodsare known in the art, are commercially available, and include, but arenot limited to, the gas driven gene delivery instrument descried inMcCabe, U.S. Pat. No. 5,584,807, the entire contents of which are hereinincorporated by reference. This method involves coating the nucleic acidsequence of interest onto heavy metal particles, and accelerating thecoated particles under the pressure of compressed gas for delivery tothe target tissue.

Other particle bombardment methods are also available for theintroduction of heterologous nucleic acid sequences into plant cells.Generally, these methods involve depositing the nucleic acid sequence ofinterest upon the surface of small, dense particles of a material suchas gold, platinum, or tungsten. The coated particles are themselves thencoated onto either a rigid surface, such as a metal plate, or onto acarrier sheet made of a fragile material such as mylar. The coated sheetis then accelerated toward the target biological tissue. The use of theflat sheet generates a uniform spread of accelerated particles whichmaximizes the number of cells receiving particles under uniformconditions, resulting in the introduction of the nucleic acid sampleinto the target tissue.

Alternatively, an expression construct may be inserted into the genomeof plant cells by infecting them with a bacterium, including but notlimited to an Agrobacterium strain previously transformed with thenucleic acid sequence of interest. Generally, disarmed Agrobacteriumcells are transformed with recombinant Ti plasmids of Agrobacteriumtumefaciens or Ri plasmids of Agrobacterium rhizogenes (such as thosedescribed in U.S. Pat. No. 4,940,838, the entire contents of which areherein incorporated by reference) which are constructed to contain thenucleic acid sequence of interest using methods well known in the art(Sambrook, J. et al., (1989) supra). The nucleic acid sequence ofinterest is then stably integrated into the plant genome by infectionwith the transformed Agrobacterium strain. For example, heterologousnucleic acid sequences have been introduced into plant tissues using thenatural DNA transfer system of Agrobacterium tumefaciens andAgrobacterium rhizogenes bacteria (for review, see Klee et al. (1987)Ann. Rev. Plant Phys. 38:467-486).

One of skill in the art knows that the efficiency of transformation byAgrobacterium may be enhanced by using a number of methods known in theart. For example, the inclusion of a natural wound response moleculesuch as acetosyringone (AS) to the Agrobacterium culture has been shownto enhance transformation efficiency with Agrobacterium tumefaciens[Shahla et al. (1987) Plant Molec. Biol. 8:291-298]. Alternatively,transformation efficiency may be enhanced by wounding the target tissueto be transformed. Wounding of plant tissue may be achieved, forexample, by punching, maceration, bombardment with microprojectiles,etc. [see, e.g., Bidney et al. (1992) Plant Molec. Biol. 18:301-313].

It may be desirable to target the nucleic acid sequence of interest to aparticular locus on the plant genome. Site-directed integration of thenucleic acid sequence of interest into the plant cell genome may beachieved by, for example, homologous recombination usingAgrobacterium-derived sequences. Generally, plant cells are incubatedwith a strain of Agrobacterium which contains a targeting vector inwhich sequences that are homologous to a DNA sequence inside the targetlocus are flanked by Agrobacterium transfer-DNA (T-DNA) sequences, aspreviously described (Offringa et al., (1996), U.S. Pat. No. 5,501,967,the entire contents of which are herein incorporated by reference). Oneof skill in the art knows that homologous recombination may be achievedusing targeting vectors which contain sequences that are homologous toany part of the targeted plant gene, whether belonging to the regulatoryelements of the gene, or the coding regions of the gene. Homologousrecombination may be achieved at any region of a plant gene so long asthe nucleic acid sequence of regions flanking the site to be targeted isknown.

Where homologous recombination is desired, the targeting vector used maybe of the replacement- or insertion-type (Offringa et al. (1996),supra). Replacement-type vectors generally contain two regions which arehomologous with the targeted genomic sequence and which flank aheterologous nucleic acid sequence, e.g., a selectable marker genesequence. Replacement type vectors result in the insertion of theselectable marker gene which thereby disrupts the targeted gene.Insertion-type vectors contain a single region of homology with thetargeted gene and result in the insertion of the entire targeting vectorinto the targeted gene.

Other methods are also available for the introduction of expressionconstructs into plant tissue, e.g., electroinjection (Nan et al. (1995)In “Biotechnology in Agriculture and Forestry,” Ed. Y. P. S. Bajaj,Springer-Verlag Berlin Heidelberg, Vol 34:145-155; Griesbach (1992)HortScience 27:620); fusion with liposomes, lysosomes, cells, minicellsor other fusible lipid-surfaced bodies (Fraley et al. (1982) Proc. Natl.Acad. Sci. USA 79:1859-1863; polyethylene glycol (Keens et al. (1982)Nature 296:72-74); chemicals that increase free DNA uptake;transformation using virus, and the like.

In one embodiment, the present invention contemplates introducingnucleic acid via the leaf disc transformation method. Horsch et al.Science 227:1229-1231 (1985). Briefly, disks are punched from thesurface of sterilized leaves and submerged with gentle shaking into aculture of A. tumefaciens that had been grown overnight in Luria Broth(LB) at 28° C. The disks are then blotted dry and placed upside-downonto nurse culture plates to induce the regeneration of shoots.Following 2-3 days, the leaf disks are transferred to petri platescontaining the same media without feeder cells or filter papers, but inthe presence of carbenicillin (500 μg/ml) and kanamycin (300 μg/ml) toselect for antibiotic resistance. 2-4 weeks later, the shoots thatdeveloped are removed from calli and placed into root-inducing mediawith the appropriate antibiotic. These shoots were then furthertransplanted into soil following the presence of root formation.

Cells and tissues which are transformed with a heterologous nucleic acidsequence of interest are readily detected using methods known in the artincluding, but not limited to, restriction mapping of the genomic DNA,PCR-analysis, DNA-DNA hybridization, DNA-RNA hybridization, DNA sequenceanalysis and the like.

Additionally, selection of transformed cells may be accomplished using aselection marker gene. It is preferred, though not necessary, that aselection marker gene be used to select transformed plant cells. Aselection marker gene may confer positive or negative selection.

A positive selection marker gene may be used in constructs for randomintegration and site-directed integration. Positive selection markergenes include antibiotic resistance genes, and herbicide resistancegenes and the like. In one embodiment, the positive selection markergene is the NPTII gene which confers resistance to geneticin (G418) orkanamycin. In another embodiment the positive selection marker gene isthe HPT gene which confers resistance to hygromycin. The choice of thepositive selection marker gene is not critical to the invention as longas it encodes a functional polypeptide product. Positive selection genesknown in the art include, but are not limited to, the ALS gene(chlorsulphuron resistance), and the DHFR-gene (methothrexateresistance).

A negative selection marker gene may also be included in the constructs.The use of one or more negative selection marker genes in combinationwith a positive selection marker gene is preferred in constructs usedfor homologous recombination. Negative selection marker genes aregenerally placed outside the regions involved in the homologousrecombination event. The negative selection marker gene serves toprovide a disadvantage (preferably lethality) to cells that haveintegrated these genes into their genome in an expressible manner. Cellsin which the targeting vectors for homologous recombination are randomlyintegrated in the genome will be harmed or killed due to the presence ofthe negative selection marker gene. Where a positive selection markergene is included in the construct, only those cells having the positiveselection marker gene integrated in their genome will survive.

The choice of the negative selection marker gene is not critical to theinvention as long as it encodes a functional polypeptide in thetransformed plant cell. The negative selection gene may for instance bechosen from the aux-2 gene from the Ti-plasmid of Agrobacterium, thetk-gene from SV40, cytochrome P450 from Streptomyces griseolus, theAdh-gene from Maize or Arabidopsis, etc. Any gene encoding an enzymecapable of converting a substance which is otherwise harmless to plantcells into a substance which is harmful to plant cells may be used.

It is not intended that the host cells which are transformed with theinvention's sequences or with expression constructs containing thesesequences be limited to cells which display any particular phenotype.All that is necessary is that the transformed cells express apolypeptide encoded by the invention's sequences. Such host cells may beused to purify the expressed polypeptides for subsequent use, (e.g., inthe food or cosmetic industry, for isolating HRGP-binding molecules, andfor making antibodies).

Nor is the invention intended to be limited to transformed cells whichexpress the invention's nucleotide sequences at a particular level, aparticular time during the cell's life cycle, or a particular part of atransformed plant. Rather, the invention expressly contemplates cellswhich express relatively low and relatively high levels of expression ofthe desired proteins, regardless of whether such expression occurs insome or all parts of the transformed plant, and whether it changes or isunchanged in level during cell growth or plant development.

F. Preferred Consensus Sequences and Portions Thereof

The present invention provides GAGP sequences, and in particular theconsensus sequence of SEQ ID NO:134. Gum arabic glycoprotein (GAGP) is alarge molecular weight, hydroxyproline-rich arabinogalactan-protein(AGP) component of gum arabic. GAGP has a simple, highly biased aminoacid composition indicating a repetitive polypeptide backbone. It hasbeen suggested that the repetitive polypeptide backbone containsrepetitive small (˜10 amino acid residues) repetitive peptide motifseach with three Hyp-arabinoside attachment sites and a singleHyp-arabinogalactan polysaccharide attachment site [Qi et al. (1991)supra]. The inventors have tested this hypothesis by generating andsequencing peptides of GAGP, and determining the glycosyl and linkageanalysis of an isolated Hyp-polysaccharide. Surprisingly, the inventorsdiscovered a 19-amino acid consensus sequence, which is roughly twicethe size of that previously postulated by Qi et al. (1991). In additionto the difference in size of the repeating motif, the inventors alsosurprisingly discovered that the peptides in the invention's 19-aminoacid consensus sequence lacked some of the amino acids present in Qi etal.'s the empirical formula [i.e., Hyp₄ Ser₂ Thr Pro Gly Leu His (SEQ IDNO:135)] of the repeat motif suggested by Qi et. al. [Qi et al. (1991)supra], most notably His (Table 6, peptide PH3G2.) The inventors alsosurprisingly discovered that the invention's 19-amino acid GAGPconsensus motif contains approximately nine Hyp residues, with only asingle polysaccharide attachment site. Judging from the Hyp-glycosideprofile of GAGP, the invention's consensus motif contained sixHyp-arabinosides rather than Qi et al.'s three, and twoHyp-polysaccharides rather than Qi et al.'s one.

The invention provides the consensus sequence (SEQ ID NO:136):A-Hyp-B-C-D-E-F-Hyp-G-H-I-Hyp-J-Hyp-Hyp-K-L-Pro-M, wherein A is selectedfrom Ser, Thr, and Ala; B is selected from Hyp, Pro, Leu, and Ile; C isselected from Pro and Hyp; D is selected from Hyp, Pro, Ser, Thr, andAla; E is selected from Leu and Ile; F is selected from Ser, Thr, andAla; G is selected from Ser, Leu, Hyp, Thr, Ala, and Ile; His selectedfrom Hyp, Pro, Leu, and Ile; I is selected from Thr, Ala, and Ser; J isselected from Thr, Ser, and Ala; K is selected from Thr, Leu, Hyp, Ser,Ala, and Ile; L is selected from Gly, Leu, Ala, and Ile; and M isselected from His, and Pro (Example 18, e.g., Tables 3 and 6). Alsoincluded within the scope of the invention are portions of the consensussequence, having from 4 to 19 contiguous amino acid residues of theconsensus sequence.

In a preferred embodiment, the invention's GAGP consensus sequencecontains 19 amino acids, of which approximately nine are Hyp residues.Judging from the Hyp-glycoside profile of GAGP (Table 7) about one inevery five Hyp residues is polysaccharide-substituted. Thus, in onepreferred embodiment, there are approximately two Hyp-polysaccharidesites in the consensus sequence and portions thereof. Without limitingthe invention to any particular mechanism, the inventors predictedarabinosylation of contiguous Hyp residues andarabinogalactan-polysaccharide addition to clustered non-contiguous Hypresidues, such as the X-Hyp-X-Hyp modules common in AGPs [Nothnagel(1997) International Review of Cytology 174:195]. Also without limitingthe invention to a particular theory, the inventors are of the view thatthe inventions's 19-amino acid consensus motif preferably containsapproximately two polysaccharide attachment sites in the clusterednon-contiguous Hyp motif [F-Hyp-G-H-I-Hyp (SEQ ID NO:137), where F isselected from Ser, Thr, and Ala; G is selected from Ser, Leu, Hyp, Thr,Ala, and Ile; His selected from Hyp, Pro, Leu, and Ile; and I isselected from Thr, Ala, and Ser] which is exemplified bySer-Hyp-Ser-Hyp-Thr-Hyp (SEQ ID NO:138)], and which is flanked byarabinosylated contiguous Hyp residues such as A-Hyp-B-C-D-E (SEQ IDNO:139) where A is selected from Ser, Thr, and Ala; B is selected fromHyp, Leu, and Ile; C is selected from Pro and Hyp; D is selected fromHyp, Ser, Thr, and Ala; E is selected from Leu and Ile; and morepreferably Ser-Hyp-Hyp-Hyp-(Hyp/Thr/Ser)-Leu (SEQ ID NO:140), and suchas J-Hyp-Hyp-K-L-Pro-M (SEQ ID NO:141) where J is selected from Thr,Ser, and Ala; K is selected from Thr, Leu, Hyp, Ser, Ala, and Ile; L isselected from Gly, Leu, Ala, and Ile; and M is selected from His, andPro; and more preferablyHyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-(Hyp/Leu)-Gly-Pro-His (SEQ ID NOs:142)(FIG. 9). The following Table 2 shows 45 illustrative sequences whichhave from 4 to 19 amino acids and which are encompassed by theinventions' SEQ ID NO:136.

TABLE 2 Exemplary Sequences* Motif Number Motif Sequence 1Ser-Hyp-Hyp-Hyp-Hyp-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His (SEQ ID NO: 143) 2Ser-Hyp-Hyp-Hyp-Thr-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Hyp-Gly-Pro-His (SEQ ID NO: 144) 3Ser-Hyp-Hyp-Hyp-Ser-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Thr-Gly-Pro-His (SEQ ID NO: 145) 4Ser-Hyp-Hyp-Hyp-Hyp-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Hyp-Gly-Pro-Hyp (SEQ ID NO: 146) 5Ser-Hyp-Leu-Pro-Thr-Leu-Ser-Hyp-Leu-Pro-Thr-Hyp-Thr-Hyp-Hyp-Hyp-Gly-Pro-His (SEQ ID NO: 147) 6Ser-Hyp-Leu-Pro-Thr-Leu-Ser-Hyp-Leu-Pro-Ala-Hyp-Thr-Hyp-Hyp-Hyp-Gly-Pro-His (SEQ ID NO: 148) 7Ser-Hyp-Hyp-Hyp-Hyp-Leu-Ser-Hyp-Ser-Leu-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-Hyp (SEQ ID NO: 149) 8Ser-Hyp-Hyp-Hyp-Hyp-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Hyp-Gly-Pro-His (SEQ ID NO: 150) 9Scr-Hyp-Hyp-Hyp-Thr-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Hyp-Gly-Pro-His (SEQ ID NO: 151) 10Ser-Hyp-Hyp-Hyp-Hyp-Leu-Ser-Hyp-Ser-Hyp-Ala-Hyp-Thr-Hyp-Hyp-Hyp-Gly-Pro-His (SEQ ID NO: 152) 11Ser-Hyp-Hyp-Hyp-Hyp-Leu-Ser-Hyp-Leu-Pro-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His (SEQ ID NO: 153) 12Ser-Hyp-Hyp-Hyp-Ser-Leu-Ser-Hyp-Leu-Pro-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His (SEQ ID NO: 154) 13Ser-Hyp-Hyp-Hyp-Thr-Leu-Ser-Hyp-Hyp-Leu-Thr-Hyp-Thr-Hyp-Hyp-Leu-Leu-Pro-His (SEQ ID NO: 155) 14Hyp-Hyp-Thr-Leu-Ser-Hyp-Hyp-Leu-Thr-Hyp-Thr-Hyp-Hyp-Leu-Leu-Pro (SEQ IDNO: 156) 15Ser-Hyp-Hyp-Hyp-Ser-Leu-Ser-Hyp-Leu-Pro-Thr-Hyp-Thr-Hyp-Hyp-Leu (SEQ IDNO: 157) 16Hyp-Hyp-Leu-Ser-Hyp-Leu-Pro-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His (SEQ IDNO: 158) 17 Ser-Hyp-Hyp-Hyp-Thr-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp (SEQID NO: 159) 18 Leu-Ser-Hyp-Ser-Leu-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-Hyp(SEQ ID NO: 160) 19Hyp-Thr-Leu-Ser-Hyp-Leu-Pro-Ala-Hyp-Thr-Hyp-Hyp-Hyp-Gly (SEQ ID NO: 161)20 Ser-Hyp-Hyp-Hyp-Hyp-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp (SEQ ID NO: 162) 21Ser-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Thr (SEQ ID NO: 163) 22Hyp-Hyp-Thr-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp (SEQ ID NO: 164) 23Hyp-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His (SEQ ID NO: 165) 24Hyp-Hyp-Thr-Leu-Ser-Hyp-Hyp-Leu-Thr-Hyp (SEQ ID NO: 166) 25Ser-Hyp-Hyp-Hyp-Ser-Leu-Ser-Hyp-Leu-Pro (SEQ ID NO: 167) 26Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His (SEQ ID NO: 168) 27Hyp-Leu-Ser-Hyp-Ser-Hyp-Ala-Hyp (SEQ ID NO: 169) 28Hyp-Hyp-Hyp-Thr-Leu-Ser-Hyp-Ser (SEQ ID NO: 170) 29Thr-Hyp-Hyp-Hyp-Gly-Pro (SEQ ID NO: 171) 30 Hyp-Hyp-Leu-Ser-Hyp-Ser (SEQID NO: 172) 31 Ser-Hyp-Leu-Pro-Ala-Hyp (SEQ ID NO: 173) 32Leu-Pro-Thr-Leu-Ser-Hyp (SEQ ID NO: 174) 33 Ser-Hyp-Ser-Hyp (SEQ ID NO:175) 34 Ser-Hyp-Thr-Hyp (SEQ ID NO: 176) 35 Thr-Hyp-Thr-Hyp (SEQ ID NO:177) 36 Thr-Hyp-Hyp-Hyp (SEQ ID NO: 178) 37Ser-Hyp-Pro-Pro-Pro-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His (SEQ ID NO: 217) 38Ser-Hyp-Hyp-Pro-Pro-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His (SEQ ID NO: 218) 39Ser-Hyp-Pro-Hyp-Pro-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His (SEQ ID NO: 219) 40Ser-Hyp-Pro-Pro-Hyp-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His (SEQ ID NO: 220) 41Ser-Hyp-Hyp-Hyp-Pro-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His (SEQ ID NO: 221) 42Ser-Hyp-Hyp-Pro-Hyp-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His (SEQ ID NO: 222) 43Ser-Hyp-Pro-Hyp-Hyp-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His (SEQ ID NO: 223) 44Ser-Hyp-Hyp-Hyp-Hyp-Leu-Ser-Hyp-Ser-Pro-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His (SEQ ID NO: 224) 45Ser-Hyp-Hyp-Hyp-Hyp-Leu-Ser-Hyp-Ser-Leu-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His (SEQ ID NO: 225) *It is preferred, for gene design, that the lastthree amino acid sequences (e.g., Gly-Pro-Xaa) be moved from the end tothe front of the DNA sequence. Most of the Pro residues will bepost-translationally modified to Hyp and glycosylated when expressed inplants - Hyp glycosylation is crucial for function. This table does notlist every variation that can be derived from the consensus sequence.

In one preferred embodiment, the consensus sequence and portions thereofis selected fromSer-Hyp-Hyp-Hyp-A-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-B-Gly-Pro-His(SEQ ID NO:179), where A is selected from Hyp, Thr and Ser, and B isselected from Hyp and Leu (Table 6). Remarkably, fifteen amino acidresidues of this sequence are “quasi-palindromic,” i.e., the side chainsequence is almost the same whether read from the N-terminus orC-terminus. Without limiting the invention to a particular theory ormechanism, it is the inventors' consideration that such peptidesymmetry, which occurs frequently in extensins and AGPs, may enhancemolecular packing, recognition, and self-assembly. Indeed, palindromicsymmetry rigidified by contiguous Hyp motifs in the motifs:Ser-Hyp-Hyp-Hyp-(Hyp) and Thr-Hyp-Hyp-(Hyp), may impart self-orderingproperties in GAGP and other HRGPs. Thus, it is the inventor'sconsideration that GAGP properties are related to the polysaccharidesubstituents. In particular, the repeating glycopeptide symmetry of twocentral polysaccharides flanked by Hyp arabinosides may enhance gumarabic's remarkable properties which include: an anomalously lowviscosity [Churms et al. (1983) Carbohydrate Research 123:267], theability to act as a flavor emulsifier and stabilizer, and GAGP'sbiological role as a component of a plastic sealant.

In one embodiment, the invention's sequences and portions thereof may beused as repeats. The repeats preferably range from 1 to 500, morepreferably from 1 to 100 and most preferably from 1 to 10. Datadisclosed herein demonstrates the production of 8, 16, 20, 32, and 64repeats of gum arabic motifs (Example 19).

The repeats may be contiguous or noncontiguous. Contiguous repeats arethose without intervening amino acids, or amino acid analogues, placedbetween the repeating sequences. The repeats may contain two or moresequences which are described by the consensus sequence (SEQ ID NO:136)and portions thereof. The two or more sequences may be the same ordifferent. Examples of a single repeat in which the two 19-amino acidsequences are different are those of motif 1-motif 2 [motif 1 (SEQ IDNO:143)=Ser-Hyp-Hyp-Hyp-Hyp-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His;motif 2 (SEQ IDNO:144)=Ser-Hyp-Hyp-Hyp-Thr-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Hyp-Gly-Pro-His],described below in Example 19. Another example of a single repeat inwhich the two 19-amino acid sequences are different are those of motif7-motif 13 of Table 2, having the sequence (SEQ ID NO:180):Gly-Pro-Hyp-Ser-Hyp-Hyp-Hyp-Thr-Leu-Ser-Hyp-Hyp-Leu-Thr-Hyp-Thr-Hyp-Hyp-Leu-Leu-Pro-His-Ser-Hyp-Hyp-Hyp-Hyp-Leu-Ser-Hyp-Ser-Leu-Thr-Hyp-Thr-Hyp-Hyp-Leu,in which motif 13 is underlined, and is flanked by motif 7. Yet anotherexample of a single repeat in which the two 19-amino acid sequences aredifferent are those of Table 2's motif 10-motif 12 having the sequence(SEQ ID NO:181):Gly-Pro-His-Ser-Hyp-Hyp-Hyp-Hyp-Leu-Ser-Hyp-Ser-Hyp-Ala-Hyp-Thr-Hyp-Hyp-Hyp-Gly-Pro-His-Ser-Hyp-Hyp-Hyp-Ser-Leu-Ser-Hyp-Leu-Pro-Thr-Hyp-Thr-Hyp-Hyp-Leu,in which motif 10 is underlined and is flanked by motif 12. Examples ofa single repeat in which the two 19-amino acid sequences are the sameare those of (motif 1-motif 1), (motif 2-motif 2), (motif 3-motif 3),etc.

In an alternative embodiment, the invention's sequences and portionsthereof are used as noncontiguous repeats, i.e., with from 1 to 1000,more preferably from 1 to 100, and even more preferably from 1 to 10,intervening amino acids, or amino acid analogues, placed between therepeating sequences. The term “amino acid analog” refers to an aminoacid is a chemically modified amino acid. Illustrative of suchmodifications would be replacement of hydrogen by an alkyl, acyl, oramino group, or formation of covalent adducts with biotin or fluorescentgroups. Amino acids include biological amino acids as well asnon-biological amino acids. The term “biological amino acid” refers toany one of the known 20 coded amino acids that a cell is capable ofintroducing into a polypeptide translated from an mRNA. The term“non-biological amino acid” refers to an amino acid that is not abiological amino acid. Non-biological amino acids are useful, forexample, because of their stereochemistry or their chemical properties.The non-biological amino acid norleucine, for example, has a side chainsimilar in shape to that of methionine. However, because it lacks a sidechain sulfur atom, norleucine is less susceptible to oxidation thanmethionine. Other examples of non-biological amino acids includeaminobutyric acids, norvaline and allo-isoleucine, that containhydrophobic side chains with different steric properties as compared tobiological amino acids. The term “derivative” when in reference to anamino acid sequence means that the amino acid sequence contains at leastone amino acid analog.

The production of repeating sequences may be achieved using methodsknown in the art [for example, Lewis et al. (1996) Protein Expression &Purification 7:400-406] and the methods described herein (Example 19).

In a preferred embodiment, the consensus sequence and portions thereofcontains at least one noncontiguous hydroxyproline sequence and/or atleast one contiguous hydroxyproline sequence. In a more preferredembodiment, the consensus sequence and portions thereof contains atleast one noncontiguous hydroxyproline sequence and at least onecontiguous hydroxyproline sequence.

The term “noncontiguous hydroxyproline sequence” refers to a sequenceselected from (Xaa-Hyp)_(x) and Xaa-Hyp-Xaa-Xaa-Hyp-Xaa, wherein Xaa isany amino acid other than hydroxyproline, and wherein x is from 2 to1000, more preferably from 2 to 100, and most preferably from 2 to 50.In a preferred embodiment, the noncontiguous hydroxyproline sequence isXaa-Hyp-Xaa-Hyp (SEQ ID NO:9), wherein Xaa is selected from Ser, Thr,and Ala.

The term “contiguous hydroxyproline sequence” refers to a sequenceselected from Xaa-Hyp-Hyp_(n) (SEQ ID NO:209) and Xaa-Pro-Hyp_(n) (SEQID NO:210), wherein n is from 1 to 100, and wherein Xaa is any aminoacid other than hydroxyproline. In a preferred embodiment, thecontiguous hydroxyproline sequence is selected from Ser-Hyp₂ (SEQ IDNO:211), Ser-Hyp₃ (SEQ ID NO:212), Ser-Hyp₄ (SEQ D NO:3), Thr-Hyp₂ (SEQID NO:213), and Thr-Hyp₃ (SEQ ID NO:214).

Data presented herein demonstrates that noncontiguous hydroxyprolinesequences [e.g., (Xaa-Hyp)_(x) where x is preferably at least 2] arefunctional glycomodules which direct the exclusive addition ofarabinogalactan polysaccharide to Hyp, while contiguous hydroxyprolinesequences are functional glycomodules which direct arabinosylation(Example 23). The term “functional” when made in reference to anoncontiguous hydroxyproline sequence or to sequences containing anoncontiguous hydroxyproline sequence means that the sequence directsexclusive addition of arabinogalactan polysaccharide to Hyp residues inthat sequence. The addition of arabinogalactan polysaccharide to Hypresidues may be determined using methods described herein (Example 23).The term “functional” when made in reference to a contiguoushydroxyproline sequence or to sequences containing a contiguoushydroxyproline sequence a means that the sequence directsarabinosylation of Hyp residues in that sequence as determined bymethods disclosed herein (Example 23).

The invention contemplates sequences that are complementary, andpartially complementary to SEQ ID NO:136 and portions thereof, such asthose which hybridize under low stringency conditions and highstringency conditions to these sequences.

The sequences of the invention may be used to isolate hydroxyprolinerich glycoprotein-binding molecules and to make polyclonal andmonoclonal antibodies as described supra. In addition, the invention'ssequences may be used as emulsifying agents and/or to stabilizeemulsions, both of which are properties which are highly valued by thefood industry for GAGP. The emulsifying and emulsion stabilizingactivities of the invention's proteins, glycoproteins, and portionsthereof may be determined using generic methods known in the art [Kevin& John (1978) J. Agric. Food Chem 26(3):716-723; James & Patel,“Development of a standard oil-in-water emulsification test forproteins,” Leatherhead Food RA Res. Rep. No. 631] which employcommercially available reagents.

For example, the following assay may be employed using orange oil(Sigma) following essentially the manufacturer's instructions.Freeze-dried glycoproteins are dissolved in 0.05 M phosphate buffer (pH6.5) at a concentration of 0.5% (m/v). The aqueous solutions arecombined with orange oil in a 60:40 (v/v) ratio. A 1 ml emulsion isprepared in a glass tube at 0° C. with a Sonic Dismembrator (FisherScientific) equipped with a Microtip probe. The amplitude value is setat 4 and mixing time is set to 1 min. For the determination ofemulsifying ability (EA), the emulsion is diluted serially with asolution containing 0.1 M NaCl and 0.1% SDS to give a final dilution of1/1500. The optical density of the diluted emulsion is then determinedin a 1-cm pathlength cuvette at a wavelength of 50 nm and defined as EA.The emulsion is stored vertically in a glass tube for 3 h at roomtemperature, then the optical density of 1:1500 dilution of the lowphase of the stored sample is measured. Emulsifying stability (ES) isdefined as the percentage optical density remaining after 2 hour ofstorage. BSA is used as a positive control. This assay has been used todetermine the activity of sequences within the scope of the invention,as described in Example 24.

G. O-Glycosylation Codes

The invention further provides sequences which signal O-glycosylation.The O-glycosylation sequences are the noncontiguous hydroxyprolinesequences (Xaa-Hyp)_(x) (SEQ ID NO:182) and Xaa-Hyp-Xaa-Xaa-Hyp-Xaa (SEQID NO:183), wherein Xaa is any amino acid other than hydroxyproline, andwherein x is a number from 1 to 1000, more preferably from 2 to 100, andyet more preferably from 2 to 50. In a more preferred embodiment, thesequence is Xaa-Hyp-Xaa-Hyp (SEQ ID NO:9), wherein Xaa is selected fromSer, Thr, and Ala.

The inventors' discovery of these sequences was based on theirhypothesis that clustered, non-contiguous Hyp residues are sites forarabinogalactan polysaccharide attachment. In particular, the inventorspredicted that Hyp galactosylation of clustered non-contiguous Hypresidues, such as the Xaa-Hyp-Xaa-Hyp repeats of AGPs, results in theaddition of a galactan core with sidechairs of arabinose and othersugars to form characteristic Hyp-arabinogalactan polysaccharides.Hitherto, these sites of arabinogalactan polysaccharide attachment havebeen poorly defined because AGPs resist proteases, and becausedegradation by partial alkaline hydrolysis yieldsarabinogalactan-glycopeptides that are difficult to purify.

The inventor's discovery of the O-glycosylation sequences relied on anew approach to HRGP glycosylation site mapping as disclosed herein. Totest their hypothesis that non-contiguous Hyp residues are sites forarabinogalactan polysaccharide attachment, the inventors designed threesynthetic genes: The first synthetic gene, dubbed Sig-(Ser-Pro)₃₂-EGFP,encoded a signal sequence (Sig) at the N-terminus followed by arepetitive Ser-Hyp motif [i.e., (Ser-Pro)₃₂] which encoded onlyclustered non-contiguous Hyp residues, which the inventors predictedwould code as polysaccharide addition sites. The (Ser-Pro)₃₂ wasfollowed by EGFP at the C-terminus (FIG. 11). The inventors predictedthat polysaccharide addition to noncontiguous Hyp should yield anexpression product containing Hyp-polysaccharide exclusively. The secondsynthetic gene, dubbed Sig-(GAGP)₃-EGFP, encoded three repeats of aslightly modified 19-amno acid residue GAGP consensus sequence (FIG. 14)and was used by the inventors to determine whether it yielded anexpression product that contains Hyp arabinosides as well asHyp-polysaccharide. The third synthetic gene was a control construct(Sig-EGFP) that encoded only the signal sequence and EGFP. Theexpression product was a control to test whether or not any Hypglycosylation could be attributed to EGFP modification that encodeputative AGP glycomodules. Data presented herein shows that, whenexpressed and targeted for secretion, the two experimental sequencemodules behaved as simple endogenous substrates for HRGP glycosyltransferases. The first construct expressing noncontiguous Hyp showedexclusive polysaccharide addition with polysaccharide O-linked to allHyp residues. In contrast, the second construct containing noncontiguousHyp and additional contiguous Hyp showed both polysaccharide andarabinooligosaccharide. From this data, the inventors arrived at theinvention's O-glycosylation sequences.

The invention's sequences find use as substrates for O-Hyp arabinosyl-and galactosyltransferases. These substrates may be used to isolate andunambiguously identify these enzymes as well as to determine theenzymes' substrate preferences.

Yet another use for the inventions' sequences is in the identificationof potential sites of oligoarabinoside addition in HRGPs, which may beinferred from their genomic sequences. Furthermore, these sequenceswould permit the transfer of useful products like exudate gumglycoproteins [Breton et al. (1998) J. Biochem. (Tokyo) 123, 1000-1009;Islam et al. (1997) Food Hydrocolloids 11, 493-505] such as GAGP fromthorny desert scrub like Acacia to other desirable crop plants.

A further use for the invention's sequences is that they facilitate thede novo design of new HRGPs and their manipulation to enhance desirableproperties. For example, glycoproteins which contain the O-glycosylationsequences of the invention may be used as emulsifying agents and/or tostabilize emulsions, as described supra as well as in Example 24.

EXPERIMENTAL

The following examples serve to illustrate certain preferred embodimentsand aspects of the present invention and are not to be construed aslimiting the scope thereof.

In the experimental disclosure which follows, the followingabbreviations apply: g (gram); mg (milligrams); μg (microgram); M(molar); mM (milliMolar); μM (microMolar); nm (nanometers); L (liter);ml (milliliter); μl (microliters); ° C. (degrees Centigrade); m (meter);sec. (second); DNA (deoxyribonucleic acid); cDNA (complementary DNA);RNA (ribonucleic acid); mRNA (messenger ribonucleic acid); X-gal(5-bromo-4-chloro-3-indolyl-β-D-galactopyranoside); LB (Luria Broth),PAGE (polyacrylamide gel electrophoresis); NAA (α-naphtaleneaceticacid); BAP (6-benzyl aminopurine); Tris(tris(hydroxymethyl)-aminomethane); PBS (phosphate buffered saline);2×SSC (0.3 M NaCl, 0.03 M Na₃citrate, pH 7.0); Agri-Bio Inc. (NorthMiami, Fla.); Analytical Scientific Instruments (Alameda, Calif.);BioRad (Richmond, Calif.); Clontech (Palo Alto Calif.); Delmonte FreshProduce (Kunia, Hi.); Difco Laboratories (Detroit, Mich.); Dole FreshFruit (Wahiawa, Hi.); Dynatech Laboratory Inc. (Chantilly Va.); GibcoBRL (Gaithersburg, Md.); Gold Bio Technology, Inc. (St. Louis, Mo.); GTECorp. (Danvers, Mass.); MSI Corp. (Micron Separations, Inc., Westboro,Mass.); Operon (Operon Technolies, Alameda, Calif.); Pioneer Hi-BredInternational, Inc. (Johnston, Iowa); 5 Prime 3 Prime (Boulder, Colo.);Sigma (St. Louis, Mo.); Promega (Promega Corp., Madison, Wis.);Stratagene (Stratagene Cloning Systems, La Jolla, Calif.); USB (U.S.Biochemical, Cleveland, Ohio).

Example 1 Determination of the Peptide Sequence of Acacia Gum ArabicGlycoproteins

In this example, GAGP (SEQ ID NO:15) was isolated and (by usingchymotrypsin) the deglycosylated polypeptide backbone was prepared.Although GAGP does not contain the usual chymotryptic cleavage sites, itdoes contain leucyl and histidyl residues which are occasionallycleaved. Chymotrypsin cleaved sufficient of these “occasionally cleaved”sites to produce a peptide map of closely related peptides.

Purification and Deglycosylation of GAGP (SEQ ID NO:15). GAGP wasisolated via preparative Superose-6 gel filtration. Anhydrous hydrogenfluoride deglycosylated it (20 mg powder/mL HF at 4° C., repeating theprocedure twice to ensure complete deglycosylation), yielding dGAGPwhich gave a single symmetrical peak (data not shown) afterre-chromatography on Superose-6. Further purification of dGAGP byreverse phase chromatography also gave a single major peak, showing ahighly biased but constant amino acid composition in fractions sampledacross the peak. These data indicated that dGAGP was a singlepolypeptide component sufficiently pure for sequence analysis.Sequence Analysis. An incomplete pronase digest gave a large peptidePRP3 which yielded a partial sequence (Table 3) containing all the aminoacids present in the suggested dGAGP repeat motif. In view of thelimitations of pronase, for further peptide mapping and to obtain moredefinitive sequence information, dGAGP was digested with chymotrypsin,followed by a two-stage HPLC fractionation scheme. Initial separation ofthe chymotryptides on a PolySULFOETHYL A™ (designated PSA, PolyLC, Inc.Ellicott City, Md.) cation exchanger yielded three major fractions: S1and S2 increased with digestion time while S3 showed a concomitantdecrease. Further chromatography on PRP-1 resolved PSA fractions S andS2 into several peptides.

TABLE 3AMINO ACID SEQUENCES OF THE GUM ARABIC GLYCOPROTEIN POLYPEPTIDE BACKBONEPeptide Sequence S1P5Ser-Hyp-Hyp-Hyp-Hyp-Leu-Ser-Hyp-Ser-Leu-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-(Pro)(SEQ ID NO: 16) S1P3Ser-Hyp-Hyp-Hyp-Hyp-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-(Pro)(SEQ ID NO: 17) S3Ser-Hyp-Hyp-Hyp-Thr-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Hyp-Gly-Pro-His-Ser-Hyp-Hyp-Hyp-(Hyp) (SEQ ID NO: 18) S1P2Ser-Hyp-Hyp-Hyp-Ser-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Thr-Gly-Pro-His(SEQ ID NO:19) S2P1 Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Hyp-Gly-Pro-His(SEQ ID NO: 20) S2P2aSer-Hyp-Ser-Hyp-Ala-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His (SEQ ID NO: 21)S2P2b Ser-Hyp-Leu-Pro-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His(SEQ ID NO: 22) S2P3aSer-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His (SEQ ID NO: 23) S2P4Ser-Hyp-Hyp-Leu-Thr-Hyp-Thr-Hyp-Hyp-Leu-Leu-Pro-His (SEQ ID NO: 24) S1P4Ser-Hyp-Leu-Pro-Thr-Leu-Ser-Hyp-Leu-Pro-Ala/Thr-Hyp-Thr-Hyp-Hyp-Hyp-Gly-Pro-His(SEQ ID NOS: 25 and 26) Consensus:

(SEQ ID NOS: 27 and 28)Edman degradation showed that these chymotryptides were closely relatedto each other, to the partial sequence of the large pronase peptide(Table 3), and to the major pronase peptide of GAGP isolated earlier byDelonnay (see above). Indeed, all can be related to a single 19-aminoacid residue consensus sequence with minor variation in some positions(Table 3). These peptides also reflect the overall amino acidcomposition and are therefore evidence of a highly repetitivepolypeptide backbone with minor variations in the repetitive motif;these include occasional substitution of Leu for Hyp and Ser.Remarkably, fifteen residues of the consensus sequence are“quasi-palindromic” i.e. the side chain sequence is almost the samewhether read from the N-terminus or C-terminus.

Example 2 Construction of Synthetic HRGP Gene Cassettes

Synthetic gene cassettes encoding contiguous and noncontiguous Hypmodules are constructed using partially overlapping sets consisting ofoligonucleotide pairs, “internal repeat pairs” and “external 3′- and5′-linker pairs” respectively, all with complementary “sticky” ends. Thedesign strategy for the repetitive HRGP modules combines provenapproaches described earlier for the production in E. coli of novelrepetitive polypeptide polymers (McGrath et al. [1990] Biotechnol. Prog.6:188), of a repetitious synthetic analog of the bioadhesive precursorprotein of the mussel Mytilus edulis, of a repetitive spider silkprotein (Lewis et al. [1996] Protein Express. Purif. 7:400), and of ahighly repetitive elastin-like polymer in tobacco [Zhang, X., Urry, D.W., and Daniell, H. “Expression of an environmentally friendly syntheticprotein-based polymer gene in transgenic tobacco plants,” Plant CellReports, 16: 174 (1996)].

The basic design strategy for synthetic HRGP gene cassettes isillustrated by the following illustrative constructs.

a) Ser-Hyp₄ (SEQ ID NO:3) Gene Cassette

A synthetic gene encoding the extensin-like Ser-Hyp₄ (SEQ ID NO:3)module is constructed using the following partially overlapping sets ofoligonucleotide pairs.

5′-Linker: Amino Acid (SEQ ID NO: 29):   A   G   S   S   T   R   A   S   P   (P P P) 5′-GCT GGA TCC TCA ACCCGG GCC TCA CCA (SEQ ID NO: 30)    CGA CCT AGG AGT TGG GCC CCG AGT GGTGGT GGT GGA-5′ (SEQ ID NO: 31) 3′ Linker (for pBI121-Sig-EGFP): AminoAcid (SEQ ID NO: 32):    P   P   P   S   P   V   A   R   N   S   P   P5′-CCA CCA CCT TCA CCG GTC GCC CGG AAT TCA CCA CCC (SEQ ID NO: 33)   AGT GGC CAG CGG GCC TTA AGT GGT GGG-5′ (SEQ ID NO: 34) 3′ Linker (forpBI121-Sig): Amino Acid: 5′-CCA CCA CCT TAA TAG AGC TCC CCC (SEQ ID NO:35)    ATT ATC TCG AGG GGG-5′ (SEQ ID NO: 36) Internal Repeat Amino Acid(SEQ ID NO: 37):    P   P   P   S   P   P   P   P   S   P 5′-CCA CCA CCTTCA CCT CCA CCC CCA TCT CCA (SEQ ID NO: 38)    AGT GGA GGT GGG GGT AGAGGT GGT GGT GGA-5′ (SEQ ID NO: 39)Conversion of the “internal” and 5′ & 3′ “external” gene cassettes tolong duplex DNA is accomplished using the following steps:

-   -   1. Heat each pair of complementary oligonucleotides to 90° and        then anneal by cooling slowly to 60° thereby forming short        duplex internal and external DNAs.    -   2. Combine the 5′ external linker duplex with the internal        repeat duplexes in an approximately 1:20 molar ratio and anneal        by further cooling to yield long duplex DNA capped by the 5′        linker. The 5′ linker is covalently joined to the internal        repeat duplex by ligation using T4 DNA ligase. (Preferably up to        50, more preferably up to 30, repeats of the internal repeat        duplex can be used).    -   3. In molar excess, combine the 3′ external linker duplex with        the above 5′ linker-internal repeat duplex, anneal and ligate as        described above.    -   4. Digest the 5′ linker-internal repeat-3′ linker duplex with        BamHI (cuts within the 5′-linker) and EcoRI (cuts within the        3′-linker).    -   5. Size fractionate the reaction products using Sephacryl gel        permeation chromatography to select constructs greater than 90        bp.    -   6. Insert the sized, digested synthetic gene cassette into a        plasmid having a polylinker containing BamHI and EcoRI sites        (e.g., pBluescript SK⁺ or KS⁺ [Stratagene]).    -   7. Transform E. coli cells (e.g., by electroporation or the use        of competent cells) with the plasmid into which the synthetic        gene construct has been ligated.    -   8. Following E. coli transformation, the internal repeat        oligonucleotides are used to screen and identify        Ampicillin-resistant colonies carrying the synthetic gene        construct.    -   9. The insert contained on the plasmids within the        Ampicillin-resistant colonies are sequenced to confirm the        fidelity of the synthetic gene construct.

b) GAGP (SEQ ID NO:15) Consensus Sequence Cassette

A synthetic gene cassette encoding the GAGP consensus sequence isgenerated as described above using the following 5′ linker, internalrepeat and 3′ linker duplexes.

5′-Linker Amino Acid (SEQ ID NO: 40):   A   A   G   S   S   T   R   A      (S P S) 5′-GCT GCC GGA TCC TCA ACCCGG GCC-3′ (SEQ ID NO: 41) 3′-CGA CGG CCT AGG AGT TGG GCC CGG AGT GGCAGT-5′ (SEQ ID NO: 42) 3′-Linker (for pBI121-Sig-EGFP) Amino Acid (SEQID NO: 43):    S   P   S   P   V   A   R   N   S   P   P 5′-TCA CCC TCACCG GTC GCC CGG AAT TCA CCA CCC-3′ (SEQ ID NO: 44) 3′GGC CAG CGG GCC TTAAGT GGT GGG-5′ (SEQ ID NO: 45) 3′-Linker (for pBI121-Sig) Amino Acid:5′-TCA CCC TCA TAA TAG AGC TCC CCC-3′ (SEQ ID NO: 46) 3′ATT ATC TCG AGGGGG-5′ (SEQ ID NO: 47) Internal Repeat Amino Acid (SEQ ID NO: 48):   S   P   S   P   T   P   T   P   P   P   G   P   H   S   P   PP   T   L 5′-TCA CCC TCA CCA ACT CCT ACC CCA CCA CCT GGT CCA CAC TCA CCACCA CCA ACA TTG-3′ (SEQ ID NO: 49) 3′-GGT TGA GGA TGG GGT GGT GGA CCAGGT GTG AGT GGT GGT GGT TGT AAC AGT GGG AGT-5′ (SEQ ID NO: 50)

Conversion of the “internal” AGP-like motif and 5′ & 3′ “external” genecassettes to long duplex DNA is accomplished using the steps describedin section a) above. Up to fifty (50) repeats of the internal repeatduplex are desirable (more preferably up to thirty (30) repeats, andmore preferably approximately twenty (20) repeats) (i.e., the wild-typeprotein contains 20 of these repeats).

Since the above GAGP internal repeat is a consensus sequence, it is alsodesirable to have repeats that comprise a repeat sequence that variesfrom the consensus sequence (see e.g. Table 3 above). In this regard,the variant sequences are likely to be glycosylated in a slightlydifferent manner, which may confer different properties (e.g. moresoluble etc.). Other constructs are shown for other illustrative modulesin Table 4.

Example 3 Isolation of Tomato P1 Extensin cDNA Clones

In order to obtain the tomato P1 extensin signal sequence (i.e., signalpeptide), P1 extensin cDNA clones were isolated using oligonucleotidesdesigned after the P1-unique protein sequence (SEQ ID NO:51):Val-Lys-Pro-Tyr-His-Pro-Thr-Hyp-Val-Tyr-Lys. When present at theN-terminus of a protein sequence, the P1 extensin signal sequencedirects the nascent peptide chain to the ER.

Example 4 Construction of One Embodiment of an Expression Vector

pBI121 is an expression vector which permits the high level expressionand secretion of inserted genes in plant cells (e.g., tomato, tobacco,members of the genus Solanaceae, members of the family Leguminoseae,non-graminaceous monocotyledons). pBI121 contains the 35S CaMV promoter,the tobacco (Nicotiana plumbaginifolia) extensin signal sequence, a EGFPgene, the termination/polyadenylation signal from the nopalinesynthetase gene (NOS-ter), a kanamycin-resistance gene (nptII) and theright and left borders of T-DNA to permit transfer into plants byAgrobacterium-mediated transformation.

TABLE 4 ILLUSTRATIVE HRGP SYNTHETIC GENE MODULES 1. MODULES FOR AGP-LIKESEQUENCES a. The [SP]_(n) Module [SP]_(n) Internal Repeat Oligo's:           5′-TCA CCC TCA CCA TCT CCT TCG CCA TCA CCC (SEQ ID NO: 52)           GGT AGA GGA AGC GGT AGT GGG AGT GGG AGT-5′ (SEQ ID NO: 53)The [SP]_(n) 3′ & 5′ External Linkers for both plasmids are the same asfor the GAGP module. b. The [AP]_(n) Module [AP]_(n) Internal RepeatOligo's:            5′-GCT CCA GCA CCT GCC CCA GCC CCT GCA CCA-3′ (SEQID NO: 54)            GGA CGG GGT CGG GGA CGT GGT-5′ (SEQ ID NO: 55)[AP]_(n) External Linker Oligo's for plasmid pBI121-Sig-EGFP 5′-Linker:5′-GCT GCC GGA TCC TCA ACC CGG (SEQ ID NO: 56)            3′-CGA CGG CCTAGG AGT TGG GCC CGA GGT CGT-5′ (SEQ ID NO: 57) 3′-Linker: 5′-GCT CCA GCACCG GTC GCC CGG AAT TCA CCA CCC-3′ (SEQ ID NO: 58)            3′- GGCCAG CGG GCC TTA AGT GGT GGG-5′ (SEQ ID NO: 59) [AP]_(n) External3′ Linker Oligos for plasmid pBI121-Sig 5′-GCT CCA GCA TAA TAG AGC TCCCCC (SEQ ID NO: 60)           ATT ATC TCG AGG GGG-5′ (SEQ ID NO: 61) c.The [TP]_(n) Module [TP]_(n) Internal Repeat Oligo's:            5′-ACACCA ACC CCT ACT CCC ACG CCA ACA CCT ACA CCC ACT CCA (SEQ ID NO: 62)           GGA TGA GGG TGC GGT TGT GGA TCT GGG TGA GGT TGT GGTTGG-5′ (SEQ ID NO: 63) [TP]_(n) External Linker Oligo's forpBI121-Sig-EGFP: 5′Linker:  5′-GCT GCC GGA TCC TCA ACC CGG (SEQ ID NO:64)            3′-CGA CGG CCT AGG AGT TGG GCC TGT GGT TGG-5′ (SEQ ID NO:65) 3′Linker:  5′-ACA CCA ACC CCG GTC GCC CGG AAT TCA CCA CCC-3′ (SEQ IDNO: 66)            GGC CAG CGG GCC TTA AGT GGT GGG-5′ (SEQ ID NO: 67)[TP]_(n) External 3′ Linker Oligos for pBI121-Sig            5′-ACA CCAACC TAA TAG AGC TCC CCC (SEQ ID NO: 68)            ATT ATC TCG AGGGGG-5′ (SEQ ID NO: 69) 2. MODULES FOR EXTENSIN-LIKE SEQUENCES a. The[SPP]_(n)Module [SPP]_(n) Internal Repeat Oligo's:            5′-CCA CCATCA CCA CCC TCT CCT CCA TCA CCC CCA TCC CCA CCA TCA (SEQ ID NO: 70)           GGT GGG AGA GGA GGT AGT GGG GGT AGG GGT GGT AGT GGT GGTAGT-5′ (SEQ ID NO: 71) [SPP]_(n) External Linkers for pBE121-Sig-EGFP:5′Linker:  5′-GCT GCC GGA TCC TCA ACC CGG GCC (SEQ ID NO: 72)           3′-CGA CGG CCT AGG AGT TGG GCC CGG GGT GGT AGT-5′ (SEQ ID NO:73) 3′Linker:  5′-CCA CCA TCA CCG GTC GCC CGG AAT TCA CCA CCC-3′ (SEQ IDNO: 74)            GGC CAG CGG GCC TTA AGT GGT GGG-5′ (SEQ ID NO: 75)[SPP]_(n) External 3′ Linker for pBE121-Sig:            5′-CCA CCA TCATAA TAG AGC TCC CCC (SEQ ID NO: 76)            ATT ATC TCG AGGGGG-5′ (SEQ ID NO: 77) b. The [SPPP]_(n) Module [SPPP]_(n) InternalRepeat Oligo's:            5′-CCA CCA CCT TCA CCA CCT CCA TCT CCC CCACCT TCC CCT CCA CCA TCA (SEQ ID NO: 78)            AGT GGT GGA GGT AGAGGG GGT GGA AGG GGA GGT GGT AGT GGT GGT GGA-5′ (SEQ ID NO: 79)[SPPP]_(n) External Linker Oligo's for pBI121-Sig-EGFP: 5′-Linker:5′-GCT GGA TCC TCA ACC CGG GCC TCA (SEQ ID NO: 80)            3′-CGA CCTAGG AGT TGG GCC CGG AGT GGT GGT GGA-5′ (SEQ ID NO: 81) 3′-Linker: 5′-CCACCA CCT TCA CCG GTC GCC CGG AAT TCA CCA CCC-3′ (SEQ ID NO: 82)           AGT GGC CAG CGG GCC TTA AGT GGT GGG-5′ (SEQ ID NO: 83)[SPPP]_(n) External 3′ Linker Oligos for pBI121-Sig:            5′-CCACCA CCT TAA TAG AGC TCC CCC (SEQ ID NO: 84)            ATT ATC TCG AGGGGG-5′ (SEQ ID NO: 85) d. The P3-Type Extensin Palindromic Module:P3-Type Extensin Palindromic Internal Repeat Oligo's:            5′-CCACCA CCT TCA CCC TCT CCA CCT CCA CCA TCT CCG TCA CCA (SEQ ID NO: 86)           AGT GGG AGA GGT GGA GGT GGT AGA GGC AGT GGT GGT GGTGGA-5′ (SEQ ID NO: 87) P3-Type Extensin Palindromic External LinkerOligo's: Use the [SPPP]_(n) linkers (SEE ABOVE) e. The Potato LectinHRGP Palindromic Module: Potato Lectin HRGP Palindromic External LinkerOligo's:            5′-CCA CCA CCT TCA CCC CCA TCT CCA CCT CCA CCA TCTCCA CCG TCA CCA (SEQ ID NO: 88)            AGT GGG GGT AGA GGT GGA GGTGGT AGA GGT GGC AGT GGT GGT GGT GGA-5′ (SEQ ID NO: 89) Potato LectinHRGP Palindromic External Linker Oligo's: Use the [SPPP]_(n) linkers(SEE ABOVE) f. P1-Extensin-Like Modules: i. The SPPPPTPVYK Module:SPPPPTPVYK Internal Repeat Oligo's:            5′-CCA CCA CCT ACT CCCGTT TAC AAA TCA CCA CCA CCA CCT ACT CCC GTT TAC AAA            TCA CCA(SEQ ID NO: 90)            TGA GGG CAA ATG TTT AGT GGT GGT GGT GGA TCAGGG CAA ATG TTT AGT GGT GGT GGT            GGA-5′ (SEQ ID NO: 91)SPPPPTPVYK External Linker Oligo's: Use the [SPPP]_(n) linkers (SEEABOVE) ii. The SPPPPVKPYHPTPVFL Module: SPPPPVKPYHPTPVFL Internal RepeatOligo's:            5′-CCA CCA CCT GTC AAG CCT TAC CAC CCC ACT CCC GTTTTT CTT TCA CCA (SEQ ID NO: 92)            CAG TTC GGA ATG GTG GGG TGAGGG CAA AAA GAA AGT GGT GGT GGT GGA-5′ (SEQ ID NO: 93) SPPPPVKPYHPTPVFLExternal Linker Oligo's: Use the [SPPP]_(n) linkers (SEE ABOVE) iii. TheSPPPPVLPFHPTPVYK Module: SPPPPVLPFHPTPVYK Internal Repeat Oligo's:           5′-CCA CCA CCT GTC TTA CCT TTC CAC CCC ACT CCC GTT TAC AAATCA CCA (SEQ ID NO: 94)            CAG AAT GGA AAG GTG GGG TGA GGG CAAATG TTT AGT GGT GGT GGT GGA-5′ (SEQ ID NO: 95) SPPPPVLPFHPTPVYK ExternalLinker Oligo's: Use the [SPPP]_(n) linkers (SEE ABOVE) EGFP 3′ LinkerOligo's needed to insert EGFP into pBI121-Sig-EGF            5′-GGC CGCGAG CTC CAG CAC GGG (SEQ ID NO: 96)            CG CTC GAG GTC GTGCCC-5′ (SEQ ID NO: 97)

The presence of the extensin signal sequence at the N-terminus ofproteins encoded by genes inserted into the pBI121 expression vector(e.g., HRGPs encoded by synthetic gene constructs). The tobacco signalsequence was demonstrated to target extensin fusion proteins through theER and Golgi for posttranslational modifications, and finally to thewall. The targeted expression of recombinant HRGPs is not dependent uponthe use of the tobacco extensin signal sequence. Signal sequencesinvolved in the transport of extensins and extensin modules in the sameplant family (Solanaceae) as tobacco may be employed; alternatively, thesignal sequence from tomato P1 extensin may be employed.

The EGFP gene encodes a green fluorescent protein (GFP) appropriatelyred-shifted for plant use (the EGFP gene encodes a S65T variantoptimized for use in plants and is available from Clontech). Othersuitable mutants may be employed (see Table 1). These modified GFPsallow the detection of less than 700 GFP molecules at the cell surface.The use of a GFP gene provides a reporter gene and permits the formationof fusion proteins comprising repetitive HRGP modules. GFPs requireaerobic conditions for oxidative formation of the fluorophore. It isfunctional at the lower temperatures used for plant cell cultures,normally it does not adversely affect protein function.

Plasmids pBI121-Sig and pBI121-Sig-EGFP are constructed as follows. Forboth plasmids, the GUS gene present in pBI121 (Clontech) is deleted bydigestion with BamHI and SstI and a pair of partially complementaryoligonucleotides encoding the tobacco extensin signal sequence isannealed to the BamHI and SstI ends. The oligonucleotides encoding the21 amino acid extensin signal sequence have the following sequence:5′-GA TCC GCA ATG GGA AAA ATG GCT TCT CTA TTT GCC ACA TTT TTA GTG GTTTTA GTG TCA CTT AGC TTA GCA CAA ACA ACC CGG GTA CCG GTC GCC ACC ATG GTGTAA AGC GGC CGC GAG CT-3′ (SEQ ID NO:98) and 5′-C GCG GCC GCT TTA CACCAT GGT GGC GAC CGG TAC CCG GGT TGT TTG TGC TAA GCT AAG TGA CAC TAA AACCAC TAA AAA TGT GGC AAA TAG AGA AGC CAT TTT TCC CAT TGC G-3′ (SEQ IDNO:99). In addition to encoding the extensin signal sequence, this pairof oligonucleotides, when inserted into the digested pBI121 vector,provides a BamHI site (5′ end) and XmaI and SstI sites (3′ end). TheXmaI and SstI sites allow the insertion of the GFP gene. The modifiedpBI121 vector lacking the GUS gene and containing the synthetic extensinsignal sequence is termed pBI121-Sig. Proper construction of pBI121 isconfirmed by DNA sequencing.

The GFP gene (e.g., the EGFP gene) is inserted into pBI121-Sig to makepBI121-Sig-EGFP as follows. The EGFP gene is excised from pEGFP(Clontech) as a 1.48 kb XmaI/NotI fragment (base pairs 270 to 1010 inpEGFP). This 1.48 kb XmaI/NotI fragment is then annealed and ligated toa synthetic 3′ linker (see above). The EGFP-3′ linker is then digestedwith SstI to produce an XmaI/SstI EGFP fragment which in inserted intothe XmaI/SstI site of pBI121-Sig to create pBI121-Sig-EGFP. The AgeI(discussed below), XmaI and SstI sites provide unique restriction enzymesites. Proper construction of the plasmids is confirmed by DNAsequencing.

The EGFP sequences in pBI121-Sig-EGFP contain an AgeI site directlybefore the translation start codon (i.e., ATG) of EGFP. Synthetic HRGPgene cassettes are inserted into the plasmid between the signal sequenceand the EGFP gene sequences as XmaI/AgeI fragments; the HRGP genecassettes are excised as XmaI/AgeI fragments from the pBluescriptconstructs described in Ex. 2. Proper construction of HRGP-containingexpression vectors is confirmed by DNA sequencing and/or restrictionenzyme digestion.

Expression of the synthetic HRGP gene cassettes is not dependent uponthe use of the pBI121-Sig and pBI121-Sig-EGFP gene cassette. Analogousexpression vectors containing other promoter elements functional inplant cells may be employed (e.g., the CaMV region IV promoter,ribulose-1,6-biphosphate (RUBP) carboxylase small subunit (ssu)promoter, the nopaline promoter, octopine promoter, mannopine promoter,the β-conglycinin promoter, the ADH promoter, heat shock promoters,tissue-specific promoters, e.g., promoters associated with fruitripening, promoters regulated during seed ripening (e.g., promoters fromthe napin, phaseolin and glycinin genes). For example, expressionvectors containing a promoter that directs high level expression ofinserted gene sequences in the seeds of plants (e.g., fruits, legumesand cereals, including but not limited to corn, wheat, rice, tomato,potato, yam, pepper, squash cucumbers, beans, peas, apple, cherry,peach, black locust, pine and maple trees) may be employed. Expressionmay also be carried out in green algae.

In addition, alternative reporter genes may be employed in place of theGFP gene. Suitable reporter genes include β-glucuronidase (GUS),neomycin phosphotransferase II gene (nptII), alkaline phosphatase,luciferase, CAT (Chloramphenicol AcetylTransferase). Preferred reportergenes lack Hyp residues. Further, the proteins encoded by the syntheticHRGP genes need not be expressed as fusion proteins. This is readilyaccomplished using the pBI121-Sig vector.

Example 5 Expression of Recombinant HRGPs in Tomato Cell SuspensionCultures

The present invention contemplates recombinant HRGPs encoded byexpression vectors comprising synthetic HRGP gene modules are expressedin tomato cell suspension cultures. The expression of recombinant HRGPsin tomato cell suspension cultures is illustrated by the discussionprovided below for recombinant GAGP expression.

a) Expression of Recombinant GAGP

An expression vector containing the synthetic GAGP gene cassette(capable of being expressed as a fusion with GFP or without GFPsequences) is introduced into tomato cell suspension cultures. A varietyof means are known to the art for the transfer of DNA into tomato cellsuspension cultures, including Agrobacterium-mediated transfer andbiolistic transformation.

Agrobacterium-mediated transformation: The present inventioncontemplates transforming both suspension cultured cells (Bonnie Bestcultures) and tomato leaf discs by mobilizing the above-describedplasmid constructions (and others) from E. coli into Agrobacteriumtumefaciens strain LBA4404 via triparental mating. Positive colonies areused to infect tomato cultures or leaf discs (Lysopersicon esculentum).Transformed cells/plants are selected on MSO medium containing 500 mg/mLcarbenicillin and 100 mg/mL kanamycin. Expression of GFP fusion productsare conveniently monitored by fluorescence microscopy using a high QFITC filter set (Chroma Technology Corp.). FITC conjugates (e.g.FITC-BSA) can be used along with purified recombinant GFP as controlsfor microscopy set-up. Cultured tomato cells show only very weakautofluorescence. Thus, one can readily verify the spatiotemporalexpression of GFP-Hyp module fusion products.

Transgenic cells/plants can be examined for transgene copy number andconstruct fidelity genomic Southern blotting and for the HRGP constructmRNA by northern blotting, using the internal repeat oligonucleotides asprobes. Controls include tissue/plants which are untransformed,transformed with the pBI121 alone, pBI121 containing only GFP, andpBI121 having the signal sequence and GFP but no HRGP synthetic gene.

Microprojectile bombardment: 1.6 M gold particles are coated with eachappropriate plasmid construct DNA for use in a Biolistic particledelivery system to transform the tomato suspension cultures/callus orother tissue. Controls include: particles without DNA, particles whichcontain PBI121 only, and particles which contain PBI121 and GFP.

b) Expression of Other HRGPs of Interest

As noted above, the present invention contemplates expressing a varietyof HRGPs, fragments and variants. Such HRGPs include, but are notlimited to, RPRps, extensins, AGPs and other plant gums (e.g. gumKaraya, gum Tragacanth, gum Ghatti, etc.). HRGP chimeras include but arenot limited to HRGP plant lectins, including the solanaceous lectins,plant chitinases, and proteins in which the HRGP portion serves as aspacer (such as in sunflower). The present invention specificallycontemplates using the HRGP modules (described above) as spacers to linknon-HRGP proteins (e.g. enzymes) together.

Example 6 Construction of a Synthetic HRGP Gene Cassette Incorporating aGAGP Construct

Synthetic gene cassettes encoding contiguous and noncontiguous Hypmodules were constructed using partially overlapping sets consisting ofoligonucleotide pairs, “internal repeat pairs” and “external 3′- and5′-linker pairs” respectively, all with complementary “sticky” ends. Thefollowing 5′-linker, internal repeat and 3′-linker duplexes wereemployed:

5′-Linker     A   A   G   S   S   T   R   A  (S   P   S) (SEQ ID NO: 40)5′-GCT GCC GGA TCC TCA ACC CGG GCC-3′ (SEQ ID NO: 41) 3′-CGA CGG CCT AGGAGT TGG GCC CGG AGT GGG AGT-5′ (SEQ ID NO: 42) 3′-Linker    S   P   S   P   V   A   R   N   S   P   P (SEQ ID NO: 43) 5′-TCA CCCTCA CCG GTC GCC CGG AAT TCA CCA CCC-3′ (SEQ ID NO: 44)            3′-GGC CAG CGG GCC TTA AGT GGT GGG-5′ (SEQ ID NO: 45)Internal Repeat      S   P   S   P   T   P   T    A   P   P   G   P   H   S   P   P   P (SEQ ID NO: 100)  T   L [5′-TCA CCCTCA CCA ACT CCT ACC GCA  CCA CCT GGT CCA CAC TCT CCA CCA CCA (SEQ ID NO:101) ACA TTG-3′] [3′-AGT GGG AGT GGT TGA GGA TGG CGT  GGT GGA CCA GGTGTG AGA GGT GGT GGT (SEQ ID NO: 102) TGT AAC-5′]₂ then:    S   P   S   P   T   P   T   A   P   P   G   P   H   S   P   P   P(SEQ ID NO: 103)  

   L 5′-TCA CCC TCA CCA ACT CCT ACC GCA CCA CCT GGT CCA CAC TCT CCA CCACCA (SEQ ID NO: 104)

 TTG-3′ 3′-AGT GGG AGT GGT TGA GGA TGG CGT GGT GGA CCA GGT GTG AGA GGTGGT GGT (SEQ ID NO: 105)

 AAC-5′The following synthetic gene (SEQ ID NO:106) was eventually expressed intobacco and tomato cell cultures and tobacco plants using the aboveconstructs:

                M   G   K   M   A   S   L   F   A   T   F   L   V   V   L   V5′-GGA TCC GCA ATG GGA AAA ATG GCT TCT CTA TTT GCC ACA TTT TTA GTC GTTTTA GTG 3′-CCT AGG CGT TAC CCT TTT TAC CGA AGA GAT AAA CGG TGT AAA AATCAC CAA AAT CAC S   L   S   L   A   Q   T   T   R   D   S   P   S   P   T   P   T   A   PTCA CTT AGC TTA GCA CAA ACA ACC CGG GAC TCA CCC TCA CCA ACT CCT ACCGCA CCA AGT GAA TCG AAT CGT GTT TGT TGG GCC CTG AGT GGG AGT GGT TGA GGATGG CGT GGT P   G   P   H   S   P   P   P   T   L   S   P   S   P   T   P   T   A   PCCT GGT CCA CAC TCT CCA CCA CCA ACA TTG TCA CCC TCA CCA ACT CCT ACC GCA CCA GGA CCA GGT GTG AGA GGT GGT GGT TGT AAC AGT GGG AGT GGT TGA GGA TGGCGT  GGT P   G   P   H   S   P   P   P   T   L   S   P   S   P   T   P   T   A   PCCT GGT CCA CAC TCA CCA CCA CCA ACA TTG TCA CCC TCA CCA ACT CCT ACC GCA CCA GGA CCA GGT GTG AGT GGT GGT GGT TGT AAC AGT GGG AGT GGT TGA GGA TGGCGT  GGT  P   G   P   H   S   P   P   P   S   L   

   P   S   P   V CCT GGT CCA CAC TCA CCA CCA CCA TCA TTG

 CCC TCA CCG GTC GCC ACC-gfp-3′ GGA CCA GGT GTG AGT GGT GGT GGT AGT AAC

 GGG AGT GGC CAG CGG TGG-gfp-5′

This example involved: (A) Oligonucleotide pair preparation;

(B) Oligonucleotide polymerization; (C) Construct precipitation; (D)Restriction of gene 3′-linker and 5′-linker capped ends; (E)Size-fractionation and removal of enzyme contaminants; (F) Geneinsertion into SK plasmid vector. All SDS-PAGE purified oligonucleotideswere synthesized by Gibco-BRL.

(A) Oligonucleotide Pair Preparation

In separate Eppendorf tubes were combined:

Tube 1) 5.5 μl GAGP internal repeat sense oligonucleotide (0.5 nmol/μl),5.5 μl GAGP internal repeat antisense oligonucleotide (0.5 nmol/μl), 11μl T4 ligase 10× ligation buffer (New England Biolabs);Tube 2) 2 μl 5′-sense linker (0.05 nmol/μl), 2 μl 5′-antisense linker(0.05 nmol/μl), 1 μl H2O, 5 μl T4 ligase 10× ligation buffer (NewEngland Biolabs);Tube 3) 2 μl 3′-sense linker (1 nmol/μl), 2 μl 3′-antisense linker (1nmol/μl), 1 μl water, 5 μl T4 ligase 10× ligation buffer (New EnglandBiolabs).All tubes were heated to 90-95° C. for 5 minutes, then slowly cooledover the next 3 hours to 45° C. The tubes were then incubated at 45° C.for 2 hours.

(B) Oligonucleotide Polymerization

10 μl of solution from Tube 1 (internal repeat pair) was combined with10 μl of solution from Tube 2 (5′ linker pair), and incubated at 17° C.for 3 hours. To this mixture was added 80 μl water and 2 μl (4000 U) T4DNA ligase (New England Biolabs), and again incubated at 12-15° C. for36 hours. The degree of polymerization was verified on 2.2% agarose gel(Fisher).

The 3′-end of the polymer was then capped by adding 50 μl of the ligatedGAGP 5′-linker mixture from above to 5 μl of solution from Tube 3(3′-linker), heating to 30° C., and incubating at 17° C. for 3 hours. 20μl water and 2 μl T4 DNA ligase (New England Biolabs) was then added,and the solution incubated at 12-15° C. for 36 hr. Finally, the solutionwas heated at 65° C. for 10 minutes to denature the ligase.

(C) Construct Precipitation

10 μl GAGP construct from (B) above was combined with 25 μl water and 5μl 3 M NaAcetate. 150 μl EtOH was then added and the solution incubatedat 4° C. for 30 minutes The solution was then centrifuged at 10,000 rpmfor 30 minutes The resultant pellet was washed with 70% EtOH and dried.

(D) Restriction of Gene 3′-Linker and 5′-Linker Capped Ends

The pellet from (C) above was dissolved in 14 μl water. 2 μl 10×EcoRIrestriction buffer (New England Biolabs), 2 μl EcoRI 10 U/μl (NewEngland Biolabs), and 2 μl BamHI 20 U/μl (New England Biolabs) was thenadded and the mixture incubated at 37° C. overnight.

(E) Size-Fractionation and Removal of Enzyme Contaminants

10 μl water was added to 20 μl of the restricted genes from Step (D)above. This mixture was then loaded onto a Sephacryl S-400 (PharmaciaMicrospin™) minicolumn and spun to remove small (<90 bp) oligonucleotidefragments. The first effluent from the column (i.e. the large MWmaterial) was collected. Finally, the enzymes were removed using aQiaquick Nucleotide removal kit (Qiagen). The final volume of mixturewas approximately 50 μl.

(F) Gene Insertion into SK Plasmid Vector

SK plasmid vector (Strategene) was restricted with BamHI and EcoRI andrestricted large plasmid fragments were isolated from agarose gel. To2-3 μg restricted SK plasmid in 10 μl water was added 6 μl restrictedGAGP gene construct from Step (E), 2 μl T4 DNA ligase buffer (NewEngland Biolabs), and 1 μl T4 DNA ligase (New England Biolabs). Thesolution was then kept at 8° C. overnight for ligation. 100 μl competentXL1-Blue cells (Stratagene) were then transformed with 3 μl ligationmixture. Clones were selected via Blue/White assay (PromegaCorporation), as described by Promega Protocols and Applciations Guide,2 ed. (1991), by hybridization with ³²P-labeled antisense internaloligonucleotide, and by restriction mapping.

Example 7 Construction of a Synthetic HRGP Gene Cassette Incorporatingan SP Construct

Synthetic gene cassettes encoding contiguous and noncontiguous Hypmodules were constructed using partially overlapping sets consisting ofoligonucleotide pairs, “internal repeat pairs” and “external 3′- and5′-linker pairs” respectively, all with complementary “sticky” ends. Thefollowing 5′-linker, internal repeat and 3′-linker duplexes wereemployed:

5′-Linker     A   A   G   S   S   T   R   A   (S   P   S) (SEQ ID NO:40) 5′-GCT GCC GGA TCC TCA ACC CGG GCC-3′ (SEQ ID NO: 41) 3′-CGA CGG CCTAGG AGT TGG GCC CGG AGT GGG AGT-5′ (SEQ ID NO: 42) 3′-Linker    S   P   S   P   V   A   R   N   S   P   P (SEQ ID NO: 43) 5′-TCA CCCTCA CCG GTC GCC CGG AAT TCA CCA CCC-3′ (SEQ ID NO: 44)            3′-GGC CAG CGG GCC TTA AGT GGT GGG-5′ (SEQ ID NO: 45)Internal Repeat     S   P   S   P   S   P   S   P   S   P  (S   P   S)(SEQ ID NO: 107) 5′-TCA CCC TCA CCA TCT CCT TCG CCA TCA CCC (SEQ ID NO:108)             3′-GGT AGA GGA AGC GGT AGT GGG AGT GGG AGT-5′ (SEQ IDNO: 109)The following synthetic gene (SEQ ID NO:110) was eventually expressed intobacco and tomato cell cultures and tobacco plants using the aboveconstructs:

    G   S   A   M   G   K   M   A   S   L   F   A   T   F   L   V   V   L   V5′-GGA TCC GCA ATG GGA AAA ATG GCT TCT CTA TTT GCC ACA TTT TTA GTG GTTTTA GTG 3′-CCT AGG CGT TAC CCT TTT TAC CGA AGA GAT AAA CGG TGT AAA AATCAC CAA AAT CAC S   L   S   L   A   Q   T   T   R   A  [ S   P   S   P   S   P   S   P   STCA CTT AGC TTA GCA CAA ACA ACC CGG GCC [TCA CCC TCA CCA TCT CCT TCG CCATCA AGT GAA TCG AAT CGT GTT TGT TGG GCC CGG [AGT GGG AGT GGT AGA GGA AGCGGT AGT  P ]   S   P   S   P   V   A   T CCC]6 TCA CCC TCA CCG GTC GCCACC-gfp-3′ GGG]6 AGT GGG AGT GGC CAG CGG TGG-gfp-5′

This example involved: (A) Oligonucleotide pair preparation;

(B) Oligonucleotide polymerization; (C) Construct precipitation; (D)Restriction of gene 3′-linker and 5′-linker capped ends; (E)Size-fractionation and removal of enzyme contaminants; (F) Geneinsertion into SK plasmid vector. All SDS-PAGE purified oligonucleotideswere synthesized by Gibco-BRL.

(A) Oligonucleotide Pair Preparation

In separate Eppendorf tubes were combined:

Tube 1) 5.5 μl SP internal repeat sense oligonucleotide (0.5 nmol/μl),5.5 μl SP internal repeat antisense oligonucleotide (0.5 nmol/μl), 11 μlT4 ligase 10× ligation buffer (New England Biolabs);Tube 2) 2 μl. 5′-sense linker (0.05 nmol/μl), 2 μl 5′-antisense linker(0.05 nmol/μl), 1 μl H2O, 5 μl T4 ligase 10× ligation buffet (NewEngland Biolabs);Tube 3) 2 μl 3′-sense linker (1 nmol/μl), 2 μl 3′-antisense linker (1nmol/μl), 1 μl water, 5 μl T4 ligase 10× ligation buffer (New EnglandBiolabs).All tubes were heated to 90-95° C. for 5 minutes, then slowly cooledover the next 3 hours to 45° C. The tubes were then incubated at 45° C.for 2 hours.

(B) Oligonucleotide Polymerization

10 μl of solution from Tube 1 (internal repeat pair) was combined with10 μl of solution from Tube 2 (5′ linker pair), and incubated at 17° C.for 3 hours. To this mixture was added 80 μl water and 2 μl (4000 U) T4DNA ligase (New England Biolabs), and again incubated at 12-15° C. for36 hours. The degree of polymerization was verified on 2.2% agarose gel(Fisher).

The 3′ end of the polymer was then capped by adding 50 μl of the ligatedSP-5′ linker mixture from above to 5 μl of solution from Tube 3 (3′linker), heating to 30° C., and incubating at 17° C. for 3 hours. 20 μlwater and 2 μl T4 DNA ligase (New England Biolabs) was then added, andthe solution was incubated at 12-15° C. for 36 hr. Finally, the solutionwas heated at 65° C. for 10 minutes to denature the ligase.

(C) Construct Precipitation

10 μl SP construct from (B) above was combined with 25 μl water and 5 μl3 M NaAcetate. 150 μl EtOH was then added and the solution incubated at4° C. for 30 minutes The solution was then centrifuged at 10,000 rpm for30 minutes The resultant pellet was washed with 70% EtOH and dried.

(D) Restriction of Gene 3′-Linker and 5′-Linker Capped Ends

The pellet from (C) above was dissolved in 14 μl water. 2 μl 10×EcoRIrestriction buffer (New England Biolabs), 2 μl EcoRI 10 U/μl (NewEngland Biolabs), and 2 μl BamHI 20 U/μl (New England Biolabs) was thenadded and the mixture incubated at 37° C. overnight.

(E) Size-Fractionation and Removal of Enzyme Contaminants

10 μl water was added to 20 μl of the restricted genes from Step (D)above. This mixture was then loaded onto a Sephacryl S-400 (PharmaciaMicrospin™) minicolumn and spun to remove small (<90 bp) oligonucleotidefragments. The first effluent from the column (i.e. the high molecularweight material) was collected. Finally, the enzymes were removed usinga Qiaquick Nucleotide removal kit (Qiagen). The final volume of mixturewas approximately 50 μl.

(F) Gene Insertion into SK Plasmid Vector

SK plasmid vector (Strategene) was restricted with BamHI and EcoRI andrestricted large plasmid fragments were isolated from agarose gel. To2-3 μg restricted SK plasmid in 10 μl water was added 6 μl restricted SPgene construct from Step (E), 2 μl T4 DNA ligase buffer (New EnglandBiolabs), and 1 μl T4 DNA ligase (New England Biolabs). The solution wasthen kept at 8° C. overnight for ligation. 100 μl competent XL1-Bluecells (Stratagene) were then transformed with 3 μl ligation mixture.Clones were selected via Blue/White assay (Promega Corporation), asdescribed by Promega Protocols and Applications Guide, 2 ed. (1991), byhybridization with 32P-labeled antisense internal oligonucleotide, andby restriction mapping.

Example 8 Gene Subcloning into pEGP, pKS, pUC18 and pBI121 and SignalSequence Synthesis

The methods of the following example were used to incorporate thesynthetic genes of Examples 6 and 7 into the pBI121 plasmid. Restrictiondigests, ligations, subclonings, and E. Coli transformations wereperformed generally according to F. M. Ausubel, ed., “Current Protocolsin Molecular Biology,” (1995), Chapter 3: Enzymatic Manipulation of DNAand DNA Restriction Mapping, Subcloning of DNA Fragments. Therestriction digests used were 1-2 μg of plasmid DNA, 5-10 U ofrestriction enzyme, and 1× recommended restriction buffer (starting withthe 10× buffer provided by the company). Samples were run on 1-2.2%agarose gels in TBE buffers. Plasmid and DNA fragments were isolatedfrom gels using QIAEX II gel extraction kits (Qiagen). The DNA ligaseemployed was 400 U T4 (New England Biolabs). Vector:fragment ratiosemployed were 1:2-1:6, and ligation volumes were 20 μl.

Transformation of E. coli was done in 5-10 μl ligation reaction volumeswith XL-Blue competent cells (Stratagene). Cells were plated on LBplates containing 50 μg/ml ampicillin or 30 μg/ml kanamycin.

Plasmid isolation was performed by growing transformed XL-Blue cells in3 mL LB-ampicillin or LB-kanamycin medium. The plasmids were thenisolated using a Wizard Plus Miniprep DNA Purification System (Promega).

This example involved: (A) Insertion of the synthetic gene into pEGFP;(B) Insertion of GAGP-EGFP or SP-EGFP fragment into pKS; (C)Construction of the Signal Sequence and cloning into pUC18; (D)Insertion of GAGP-EGFP or SP-EGFP construct into pUC18; (E) Insertion ofSS-GAGP-EGFP or SS-SP-EGFP genes into pBI121.

(A) Insert Synthetic Gene for GAGP or SP into pEGFP

This step was carried out to allow directional cloning of the gene atthe 5′ end of EGFP. First, the GAGP or SP gene was isolated from pSK[from Examples 6(F) and 7(F)] as a BamHI (New England Biolabs) and AgeI(New England Biolabs) fragment. The pEGFP (Clontech) was then restrictedwith BamHI and AgeI. Finally, the BamHI/AgeI-restricted gene wasannealed with BamHI/AgeI-restricted pEGFP, and ligated to yield pEGFPcontaining the synthetic gene inserted at the 5′ end of the EGFP.

(B) Insert GAGP-EGFP or SP-EGFP Fragment into pKS

This step was carried out to obtain an Sst I site at the 3′ end of EGFP.The GAGP-EGFP or SP-EGFP construct from (A) above was isolated frompEGFP as an XmaI/NotI fragment. pKS (Strategene) was then restrictedwith XmaI and NotI (New England Biolabs). Finally, the GAGP-EGFP orSP-EGFP construct was annealed with cut pKS and ligated to yield pKScontaining GAGP-EGFP or SP-EGFP.

(C) Construct of the Signal Sequence and Cloning into pUC18

In order to anneal the partially overlapping sense and antisenseoligonucleotides encoding the extensin signal sequence, 2 μl signalsequence sense oligonucleotide (0.1 nmol/μl), 2 μl signal sequenceantisense oligonucleotide (0.1 nmol/μl), 2 μl 10×DNA Polymerase Buffer(New England Biolabs), and 14 μl H₂O was combined and heated to 85° C.for 5 minutes The mixture was then slowly cooled to 40° C. over 1 hour.

The annealed oligonucleotides were then extended via primer extension.To the above mixture was added 2 μl dNTP 2.5 mM (New England Biolabs)and 1 μl DNA Polymerase 5 U/μl (New England Biolabs), and the resultantmixture incubated at 37° C. for 10 minutes The polymerase was thendenatured by heating at 70° C. for 10 minutes Then 8 μl Buffer 4 (NewEngland Biolabs), 66 μl H₂O, 2 μl BamHI 20 U/μl (New England Biolabs),and 2 μl SstI 14 U/μl (Sigma) was added and the mixture incubated at 37°C. overnight. The restriction enzymes were then denatured by heating at70° C. for 10 minutes.

The mixture was then precipitated with EtOH/NaAcetate (6 μlNaAcetate/300 μl EtOH), and pelletized in a centrifuge. The pellet waswashed with 70% EtOH and dried. The pellet was then dissolved in 20 μlH₂O and 4 μl was used for ligation into 2 μg pSK (Stratagene) as aBamHI/SstI fragment. Finally, the signal sequence was subcloned intopUC18 as a BamHI/SstI fragment.

(D) Insertion of GAGP-EGFP or SP-EGFP Construct into pUC18

This step was carried out to insert the GAGP-EGFP or SP-EGFP construct“behind” the signal sequence. The GAGP-EGFP or SP-EGFP construct from(B) above was removed from pKS as an XmaI/SstI fragment. pUC18containing the signal sequence (SS-pUC18) was restricted with XmaI/Sst.The GAGP-EGFP or SP-EGFP fragment was then annealed with cut SS-pUC18,and ligated. The SS-GAGP-EGFP or SS-SP-EGFP gene sequence was thenconfirmed through DNA sequencing using the pUC18 17-residue sequencingprimer (Stratagene).

(E) Insertion of SS-GAGP-EGFP or SS-SP-EGFP Genes into pBI121

The SS-GAGP-EGFP or SS-SP-EGFP gene from (D) above was removed frompUC18 as BamHI/SstI fragments. pBI121 (Clontech) was restricted withBamHI and SstI and the larger plasmid fragments recovered. The smallerfragments, containing the GUS reporter gene, were discarded. TheSS-GAGP-EGFP or SS-SP-EGFP fragment was annealed with the restrictedpBI121 fragment and ligated.

Example 9 Agrobacterium Transformation with pBI121-Derived Plasmids

2 μg of the pBI121 containing SS-GAGP-EGFP or SS-SP-EGFP from Example 8above was used to transform Agrobacterium tumefaciens (Strain LB4404,from Dr. Ron Sederoff, North Carolina State University) according to Anet al., Plant Molecular Biology Manual A3:1-19 (1988).

Example 10 Transformation of Tobacco Cultured Cells with pBI121-DerivedPlasmids

All steps were carried out under sterile conditions. Tobacco cells weregrown for 5-7 days in NT-1 medium (pH 5.2, per liter: 1 L packet of MSSalts (Sigma #S5524), 30 g sucrose, 3 ml 6% KH₂PO₄, 100 mg Myo-Inositol,1 mL Thiamine.HCl (1 mg/ml stock), 20 μl 2,4-D (10 mg/ml stock))containing 100 μg/ml kanamycin. The cells were grown in 1 L flaskscontaining 500 mL medium on a rotary shaker (94 rpm, 27° C.) to between15-40% packed cell volume. Agrobacterium cells transformed withpBI121-derived plasmid (Example 9) were grown overnight in Luria Brothcontaining 30 μg/ml kanamycin. The Agrobacterium cell broth waspelletized for 1 minutes at 6000 rpm, and the pellet resuspended in 200μl NT-1 medium.

Excess medium was removed from the tobacco cell broth until the brothhad a consistency approximate to applesauce. The tobacco cells wereplaced in petri dish, and 200 μl of the Agrobacterium preparation wasadded. The mixture was then incubated at room temperature, no light, for48 hours.

The mixture was then washed 4 times with 20 ml NT-1 to remove theAgrobacterium cells, and the plant cells were plate-washed on NT-1plates containing 400 μg/ml timentin and 100 μg/ml kanamycin. Cellswhich grew on the antibiotics were selected and checked for greenfluorescence through fluorescence microscopy, excitation wavelength 488nm (see Example 16).

Example 11 Transformation of Tomato Cultured Cells with pBI12′-DerivedPlasmids

All steps were carried out under sterile conditions. Tomato cells weregrown for 5-7 days in Schenk-Hildebrand medium (pH 5.8, per liter: 1 Lpacket of S—H basal salt (Sigma #S6765), 34 g sucrose, 1 gSchenck-Hildebrandt vitamin powder (Sigma #S3766), 100 μl Kinetin 1mg/ml stock (Sigma #K32532), 44 μl 2,4-D 10 mg/ml stock, 2.1 mlp-chlorophenoxy acetic acid 1 mg/ml stock (Sigma)) containing 200 μg/mlkanamycin. The cells were grown in 1 L flasks containing 500 mL mediumon a rotary shaker (94 rpm, 27° C.) to between 15-40% packed cellvolume. Agrobacterium cells transformed with pBI121-derived plasmid(Example 9) were grown overnight in Luria Broth containing 30 μg/mlkanamycin. The Agrobacterium cell broth was pelletized for 1 minutes at6000 rpm, and the pellet resuspended in 200 μl NT-1 medium.

Excess medium was removed from the tomato cell broth until the broth hada consistency approximate to applesauce. The tomato cells were placed inpetri dish, and 200 μl of the Agrobacterium preparation was added. Themixture was then incubated at room temperature, no light, for 48 hours.

The mixture was then washed 4 times with 20 ml NT-1 to remove theAgrobacterium cells, and then the plant cells were plate-washed on NT-1plates containing 400 μg/ml timentin and 200 μg/ml kanamycin. Cellswhich grew on the antibiotics were selected and checked for greenfluorescence through fluorescence microscopy, excitation wavelength 488nm.

Example 12 Isolation of GAGP-EGFP from Tobacco Cell Suspension CultureMedium

Transformed tobacco cells were grown on rotary shaker as described inExample 11 above. The medium was separated from the cells by filtrationon a glass sintered funnel (coarse grade), and the medium concentratedby freeze-drying. The medium was then resuspended in water (˜50 ml/500mL original volume before lyophilization), and dialyzed against coldwater for 48 hours (water changed 6 times). The precipitated pectincontaminants were removed by centrifuge, the pellet discarded, and thesupernatant freeze-dried. The dried supernatant was then dissolved inSuperose Buffer 20 mg/ml (200 mM sodium phosphate buffer, pH 7,containing 0.05% sodium azide), and spun in a centrifuge to pelletizeinsolubles. 1.5 ml of this preparation (18-30 mg) was then injected intoa semi-preparative Superose-12 gel filtration column (Pharmacia),equilibrated in Superose Buffer and eluted at 1 ml/minutes The UVabsorbance was monitored at 220 nm. 2 ml fractions were collectedthroughout, with GAGP-EGFP expected to elute between 59 and 70 minutes(˜2.5 Vo). GAGP-EGFP actually eluted at 65 minutes (see FIG. 3, Example15 for method used to analyze peaks).

The Superose peak containing GAGP-EGFP was dialyzed against cold waterfor 24 hours (4 water changes), and freeze-dried. The dried GAGP-EGFPpeak was then dissolved in 250 μl 0.1% aqueous TFA (Pierce) and loadedonto a PRP-1 column (Polymeric Reverse Phase, Hamilton) equilibrated inBuffer A (0.1% aqueous TFA). The column was then eluted with Buffer B(0.1% TFA/80% acetonitrile in water; gradient=0-70% B/100 min) at a rateof 0.5 mL/minutes UV absorbance was monitored at 220 nm, and GAGP-EGFPeluted at 63 minutes (see FIG. 4, Example 15 for method used to analyzepeaks). Finally, the TFA/acetonitrile was removed through N₂ (g)blowdown.

Example 13 Characterization of GAGP-EGFP by Neutral Sugar Analysis

100 μg of GAGP-EGFP isolated from tobacco cells was aliquoted into a 1ml glass microvial and dried under N₂ (g). 200 μl 2N TFA was added andthe vial capped. The vial was heated at 121° C. for 1 hour, then blowndown under N₂ at 50° C. to rid the sample of acid. 25 μl of sodiumborohydride solution (20 mg/ml in 3 M ammonium hydroxide) was added andthe mixture kept at room temperature for 1 hour. 1-3 drops ofconcentrated acetic acid were added until fizzing stops, and the mixtureblown down under N₂ at 40° C. 100 μl MeOH was added, the mixturevortexed, and blown down under N₂ at 40° C., then this step wasrepeated. A mixture of 100 μl MeOH and 100 μl H₂O was added, vortexed,and blown down under N₂ at 40° C., then the procedure of adding 100 μlMeOH, vortexing, and N₂ treatment was repeated 3 times. The resultantmixture was then dried under vacuum overnight.

50 μl reagent grade acetic anhydride was added and the mixture heated at121° C. for 0.5 hour. The sample was then analyzed by gas chromatographyas described in Kieliszewski et al., Plant Physiol. 98:919 (1992). Thesample was shown to contain hydroxyproline and sugar, accounting for˜50% of the fusion product on a dry weight basis. Galactose, arabinose,and rhamnose occur in 3:3:1 molar ratio similar to that of native GAGP's3.5:4:1 molar ratio. This is consistent with the likely presence of bothHyp-arabinosides and Hyp-arabinogalactan polysaccharide in the expressedconstruct. The lower ratio of Ara in the GAGP-EGFP fusion glycoproteinis consistent with the Ala for Pro substitution (See Example 6), whichremoves one arabinosylation site in the peptide.

Example 14 Characterization of GAGP-EGFP by Hydroxyproline Assay

100 μg purified GAGP-EGFP was hydrolyzed with 6N HCl (Pierce) at 110° C.for 18 hours. The excess acid was then removed by blowing down under N₂.Hydroxyproline was then determined following Kivirikko and Liesma,Scand. J. Clin. Lab. Invest. 11:128 (1959).

Example 15 Characterization of Tobacco and Tomato Expression Products byEnyzme-Linked Immunosorbant Assay

GAGP-EGFP and SP-EGFP products from tomato and tobacco cell medium andcolumn peaks (see Example 12) were detected by Enyzme-LinkedImmunosorbant Assays (ELISA) using the method of Kieliszewski andLamport, “Cross-reactivities of polyclonal antibodies against extensionprecursors determined via ELISA techniques,” Phytochemistry 25:673-677(1986). The GAGP-EGFP product was also assayed using anti-EGFPantibodies. Anti-EGFP antibodies (Clontech) were the primary antibody,diluted 1000-fold as recommended by the manufacturer. The secondaryantibody was Peroxidase conjugated goat-anti-rabbit IGG diluted5000-fold (Sigma). Recombinant EGFP (Clontech) was used as a control.This assay was used to generate FIGS. 3 and 4 from Example 12 above.

Example 16 Characterization of Tobacco and Tomato Expression Products byFluorescence

Culture medium from both tobacco and tomato cells transformed with theGAGP-EGFP and the SP-EGFP genes was collected. The EGFP tag fluoresceswhen exposed to UV light; the excitation wavelength used here was 488nm. These media were compared with media which included EGFP expressedbehind the signal sequence and secreted into the medium, cellstransformed with unaltered pBI121 and medium from untransformed cells.The unmistakable bright green fluorescence (data not shown) allowedvisualization of the targeted products during their transit through theER/Golgi membrane system. As Agrobacterium lacks the posttranslationalmachinery to make HRGPs, the fluorescing proteins must be of plantorigin.

Example 17 Tobacco Leaf Disc Transformation

Sterile tobacco leaves were cut into small pieces and wounded with aneedle. 4 ml NT-1 medium without hormones (NT-1 medium of Example 10,omitting 2-4 D) and 150 ul concentrated overnight culture ofAgrobacterium (see Example 9) was added to the leaves, and the leafdiscs incubated for 48 hours, no light. The leaf discs were then washedwith NT-1 medium, no hormones. The discs were then put on NT-1 solidmedium plates (NT-1 medium of Example 10 plus 7.5 g Bactoagar (DifcoLaboratories)), 400 ul/ml timentin, and 100 ug/ml kanamycin.

After 3 weeks, shoots were transferred from NT-B solid medium withouthormones [NT-1 Medium of Example 10, omitting 2-4 D, and adding 300 ul/Lbenzyl adenine, made from a 2 mg/ml stock made up inDMSO(N-benzyl-9-(tetrahydropyranyl)adenine (Sigma)] to root. Transformedplants have expressed SP-EGFP and GAGP-EGFP in leaf and root cells, asdetermined by the fluorescence assay of Example 16 (data not shown)

Example 18 Sequence Analysis of GAGP and Determination of a ConsensusSequence

This Example describes amino acid sequencing, glycosyl and linkageanalysis of GAGP which yielded sequences (including preferred consensussequences) within the scope of SEQ ID NO:136.

1. Experimental

The following experimental protocols were used to arrive at preferredembodiments of the invention's sequences.

A. Size Fractionation

GAGP was isolated via preparative Superose-6 gel filtration using themethod of Qi et al. [Qi et al. (1991) supra] as follows. Nodules of gumarabic (Kordofan Province, Sudan) were a gift from Gary Wine of AEPColloids (Ballston Spa, N.Y.). Nodules were ground to a fine flour (ca.2 min.) in a Tekmar A-10 mill. Samples of gum arabic (100 mg/ml) weredissolved in water then diluted to 50 mg/ml in 0.2 M sodium phosphatebuffer (pH 7). Samples were spun to pellet insoluble material and 1 mlaliquots were injected onto a semi-preparative Superose-6 gel filtrationcolumn (1.6 cm i.d.×50 cm, Pharmacia), eluted isocratically as describedpreviously [Qi et al. (1991) supra]. The protein peaks corresponding toGAGP were dialyzed against water to remove salt and then freeze-dried.

B. HF-Deglycosylation

For chymotryptic peptide mapping GAGP was HF-deglycosylated as follows.The Superose-6 fractionated GAGP (designated dGAGP) was deglycosylatedin anhydrous hydrogen fluoride (HF) (20 mg powder/ml HF for 1 h at 4°)as described earlier [Qi et al. (1991) supra], repeating the proceduretwice to ensure complete deglycosylation.

C Purification of Size-Fractionated GAGP and dGAGP by Reverse Phase HPLC

Superose-fractionated GAGP was purified for glycoside analyses, or dGAGPsamples were used for peptide mapping on a Hamilton PRP-1semi-preparative column (10 mm, 250×4.1 mm) by equilibrating with BufferA (0.1% TFA, aqueous) and eluting with Buffer B (0.1% TFA, 80%acetonitrile, aqueous) by gradient elution (0-100% B/80 min.; 0.5 mL/minflow rate). The eluate was monitored at 220 nm. The collected peaks wereblown down to dryness with N₂(g), redissolved in ddH₂O, thenfreeze-dried.

D. Proteolysis of Deglycosylated GAGP with Chymotrypsin or Pronase

2-9 mg samples of dGAGP were digested with pronase or chymotrypsin asdetailed earlier [Kieliszewski et al. (1992) Plant Physiology 99:538].The digests were then freeze-dried.

E. Fractionation of dGAGP Chymotryptic Peptides by Cation Exchange HPLC

dGAGP chymotryptic peptides (400 mg/injection) were fractionated on aPolySULFOETHYL A™ cation exchange column (9.4 mm i.d.×200 mm; PolyLC,Ellicot City, Md.) equilibrated with Buffer. A (5 mM potassiumphosphate/phosphoric acid buffer, pH 3, containing 25% v/v acetonitrile)and eluted with Buffer B (Buffer A containing 1 M KCl) using programmedgradient elution. The elution gradient was 0-4% Buffer B in 45 min.,4-8% Buffer B from 45 to 50 min, and 8-30% Buffer B from 50-65 min. Theflow rate was 0.4 mL/min and the absorbance was monitored at 220 nm. Thecollected peaks were pooled, blown down with N₂ (g), redissolved inddH₂O, then freeze dried.

F. Peptide Isolation Via Reverse Phase HPLC

The partial pronase digest of dGAGP and major peaks S1 and S2PolySULFOETHYL Aspartamide column were dissolved in Buffer A (0.1% TFA,aqueous) and injected onto a Hamilton PRP-1 analytical reverse phasecolumn (4.1 mm i.d.×150 mm) which was eluted at 0.5 mL/min with a BufferB (0.1% TFA and 80% v/v acetonitrile) gradient of 0-50% in 100 min. Theeffluent was monitored at 220 nm and collected peaks were blown downwith N₂(g), re-dissolved in ddH₂O, and then freeze dried prior tosequencing. For increased resolution of pronase peptide P3 (FIG. 6), P3was run through the PRP-1 column a second time, eluting with a 0-30%Buffer B gradient.

G. Automated Edman Degradation of dGAGP Chymotryptic Peptides

dGAGP peptides were sequenced at the Michigan State UniversityMacromolecular Facility on a 477A Applied Biosystems (Foster City,Calif.) gas phase sequencer.

H. Amino Acid Analysis

Amino acid compositions were determined by precolumn derivatization ofamino acids with 6-aminoquinolyl-N-hydroxysuccinimidyl carbamatefollowed by reverse-phase HPLC (Nova-Pak™ C₁₈ column) using the WatersAccQ-Tag Chemistry Package and the gradient recommended by Waters foranalyzing collagen hydrolysates [Crimmins and Cheman (1997) AnalyticalBiochemistry 244:407; van Wandelen and Cohen (1997) Journal ofChromatography A 763:11].

Hydroxyproline Glycoside Profile. The distribution of GAGPhydroxyproline glycosides was determined after alkaline hydrolysis(105°, 18 h, 0.22 N Ba(OH)₂) and neutralization followed bychromatography on a 75×0.6 cm Technicon Chromobeads C2 cation exchangeresin as described earlier [Lamport and Miller (1971) Plant Physiology48:454].

I. Isolation of the Hyp-Polysaccharide

Alkaline hydrolysates (see above) of Superose-6 and PRP-1 purified GAGPwere loaded onto a G-50 Sephadex gel permeation column elutedisocratically with 100 mM ammonium acetate buffer, pH 6.8, at a flowrate of 0.3 ml/min. One ml fractions were collected and 40 ml aliquotsof each fraction were assayed for Hyp as described earlier [Kivirikkoand Liesmaa (1959) Scandinavian Journal of Clinical Laboratories 11:128;Kieliszewski et al. (1990) Plant Physiology 92:316]. The fractions werefreeze-dried, then weighed, and the amounts of Hyp and sugar in thefractions were calculated from the recovered weights, Hyp assays, andmonosaccharide composition analyses.

J. Partial Alkaline Hydrolysis of GAGP

Superose-fractionated GAGP (10 mg/ml) was dissolved in 0.2 N NaOH/NaBH₄and heated it at 50° C. as described earlier [Akiyama and Kato (1984)Agricultural and Biological Chemistry 48:235]. A 200 ml aliquot wasremoved immediately (time zero control) and hourly for 6 h, cooled inice, then 20 ml glacial acetic acid was added (final pH=5.8). Eachsample was assayed for Hyp as described earlier [Kivirikko and Liesmaa(1959) Scandinavian Journal of Clinical Laboratories 11:128;Kieliszewski et al. (1990) Plant Physiology 92:316].

K. Saccharide Composition and Linkage Analysis

Monosaccharide compositions and linkage analyses were determined at theComplex Carbohydrate Research Center, University of Georgia followingthe methods of York et al [York et al. (1985) Methods in Enzymology118:3] and Merkle and Poppe [Merkle and Poppe (1994) Methods Enzymology230:1].

2. Determination of an Exemplary Consensus Sequence

Using the method of Qi et al. [Qi et al. (1991) supra] the inventorsisolated GAGP via preparative Superose-6 gel filtration. Forchymotryptic peptide mapping HF-deglycosylated GAGP was used. This gavea major symmetrical peak (designated dGAGP) when further fractionated byreverse phase chromatography as shown in FIG. 5. FIG. 5 is the elutionprofile for dGAGP by reverse phase chromatography on a Hamilton PRP-1column and fractionation by gradient elution. The component at 35 min.was a Hyp-poor contaminant.

Amino acid analysis showed dGAGP had a highly biased but constant aminoacid composition in fractions sampled across the peak (Table 5),indicating that dGAGP was a single polypeptide component sufficientlypure for sequence analysis.

TABLE 5 dGAGP Peak Fractions* GAGP Amino [Qi et al. Acid⁺ GAGP AscendingCenter Descending (1991) supra] Hyp 40.0 38.4 36.7 36.3 36.9 Asx 0.0 0.00.0 0.0 1.6 Ser 22.2 21.6 21.6 22.5 19.4 Glx 0.0 0.0 0.0 0.0 1.9 Gly 4.54.8 4.4 4.3 6.4 His 6.6 8.7 8.2 8.4 7.1 Arg 0.0 0.0 0.0 0.0 0.0 Thr 10.210.6 12.2 11.4 8.8 Ala 1.2 0.7 0.8 1.0 1.3 Pro 8.0 7.6 8.3 8.1 6.8 Tyr0.0 0.0 0.0 0.0 0.3 Val 0.0 0.0 0.0 0.0 0.8 Met n.d.⁺⁺ n.d.⁺⁺ n.d.⁺⁺n.d.⁺⁺ n.d.⁺⁺ Lys 0.0 0.0 0.0 0.0 1.0 Ile 0.2 0.0 0.0 0.0 0.4 Leu 6.47.6 7.8 8.1 6.4 Phe 0.5 0.0 0.0 0.0 0.9 Trp n.d.⁺⁺ n.d.⁺⁺ n.d.⁺⁺ n.d.⁺⁺n.d.⁺⁺ Cys 0.0 0.0 0.0 0.0 0.0 Amino acid compositions of glycosylatedGAGP (GAGP) and deglycosylated GAGP (dGAGP) fractions obtained byreverse phase HPLC compared to dGAGP isolated by Qi et. al. [Qi et al.(1991) Plant Physiology 96: 848] *To check peak homogeneity, threeconsecutive fractions across the dGAGP peak were analyzed (designatedAscending, Center, and Descending). ⁺represented as mole percent. ⁺⁺notdetermined.This was confirmed by the isolation of peptides (Table 6) similar incomposition to one other and to the parent GAGP (Table 5).

TABLE 6 Pronase and chymotryptic peptide sequences from the dGAGPPolypeptide Backbone Pronase Peptide Sequence P1 (SEQ ID NOs: 184, 185)SOOOTLSOSOTOTOOOGPHSOOO(O)- P3 (SEQ ID NOs: 186, 187)SOOO(T/S)LSOSOTOTXOO- PH3G2+ (SEQ ID NO: 188) SOSOTOTOOOGP ChymotrypticPeptide Sequence S1P2 (SEQ ID NO: 189) SOOOSLSOSOTOTOOTGPH S1P3 (SEQ IDNo: 190) SOOOOLSOSOTOTOOOGP- S1P4 (SEQ ID NOs: 191, 192)SOLPTLSOLP(A/T)OTOOOGPH S1P5 (SEQ ID NO: 193) SOOOOLSOSLTOTOOLGP- S2P1(SEQ ID NO: 194) SOSOTOTOOOGPH S2P2a (SEQ ID NO: 195) SOSOAOTOOLGPHS2P2b (SEQ ID NO: 196) SOLPTOTOOLGPHS S2P3 (SEQ ID NO: 197)SOSOTOTOOLGPH S2P4 (SEQ ID NO: 198) SOOLTOTOOLLPH Consensus⁺⁺ (SEQ IDNO: 179) SOOO(O/T/S)LSOSOTOTOO(O/L)GPH * O denotes hydroxyproline in thepeptide sequences; X denotes a blank cycle. +From Delonnay et al. (1993)⁺⁺Derived from the major peptides P1, P3, S1P3, S1P5, S2P1, S2P3 andPH3G2.

Although native GAGP resists pronase digestion [Akiyama and Kato (1984)Agricultural and Biological Chemistry 48:235; Chikamai et al. (1996)Food Hydrocolloids 10:309], which only generates large fragments of ˜200kDa [Connolly et al. (1988) Carbohydrate Polymers 8:23], preliminarywork in Lamport's laboratory showed that exhaustive digestion withpronase effectively cleaved dGAGP to small peptides [Delonnay (1993)Masters Thesis, Michigan State University, MI]. However, the peptideslacked some of the amino acids present in Qi et al.'s empirical formula:Hyp₄ Ser₂ Thr Pro Gly Leu His (SEQ ID NO:199) of the repeat motifsuggested by Qi et. al. [Qi et al. (1991) supra], most notably His(Table 6, peptide PH3G2.) Therefore, a partial pronase digestion ofdGAGP was performed. This gave two large major peptides P1 and P3, asshown in FIG. 6, with partial sequences (Table 6) containing all of theamino acids in the empirical formula.

dGAGP was also digested with chymotrypsin, which slowly cleaved leucyland histidyl bonds, followed by a two-stage HPLC fractionation scheme.Initial separation of the chymotryptides on a PolySULFOETHYL A™ (PolyLC,Inc. Ellicott City, Md.) cation exchanger yielded two major fractionsdesignated S1 and S2 (FIG. 7). The major chymotryptic fractions, S1 andS2, were collected for further fractionation by reverse phase columnchromatography. Further chromatography on a Hamilton PRP-1 reverse phasecolumn resolved fraction S1 into five major peptides labeled S1P1-S1P5,while fractionation of S2 resolved four major peptides, designatedS2P1-S2P4, which were sequenced (FIG. 8 a & b). Edman degradation showedthat these chymotryptides were closely related to each other and to thepronase peptides (Table 6). These peptides reflect the overall aminoacid composition of GAGP and can be related to the 19-amino acid residueconsensus sequence (SEQ ID NO:179) shown in Table 6.

From the above data, the inventors concluded that GAGP possesses ahighly repetitive polypeptide, albeit with minor variations in thesequence. Based on a linear GAGP molecule of 150 nm [Qi et al. (1991)supra], and presuming the extended polyproline II helix present in bothextensins and AGPs [Kieliszewski and Lamport (1994) Plant Journal 5:157;Nothnagel (1997) International Review of Cytology 174:195], theinventors estimate that GAGP contains about 20 peptide repeats withoccasional partial repeats. Partial repeats of the consensus sequencemay account for the somewhat higher serine content in native GAGPcompared to that in the consensus sequence.

The exemplary 19-amino acid residue GAGP consensus sequence of Table 6contains approximately nine Hyp residues and is roughly twice the sizeof that previously postulated to contain only a single polysaccharideattachment site [Qi et al. (1991) supra]. Judging from the Hyp-glycosideprofile of GAGP (Table 7) [Qi et al. (1991) supra], about one in everyfive Hyp residues is polysaccharide-substituted.

TABLE 7 GAGP Hydroxyproline glycoside profile Percent of Hydroxyprolineglycoside total hydroxyproline Hyp-polysaccharide 20 Hyp-Ara₄ (SEQ IDNO: 200) 5 Hyp-Ara₃ (SEQ ID NO: 201) 27 Hyp-Ara₂ (SEQ ID NO: 202) 27Hyp-Ara (SEQ ID NO: 203) 10 Nonglycosylated Hyp 11Thus, there are approximately two Hyp-polysaccharide sites in theinvention's exemplary consensus sequence. In order to determine whichHyp residues are involved in polysaccharide attachment, without limitingthe invention to any particular mechanism, the inventors predictarabinosylation of contiguous Hyp residues andarabinogalactan-polysaccharide addition to clustered non-contiguous Hypresidues, such as the X-Hyp-X-Hyp modules common in AGPs [Nothnagel(1997) International Review of Cytology 174:195]. Based on thisprediction, it is the inventor's view that the exemplary consensussequence of Table 6 contains approximately two polysaccharide attachmentsites in the clustered non-contiguous Hyp motif: Ser-Hyp-Ser-Hyp-Thr-Hypwhich is flanked by arabinosylated contiguous Hyp residues as depictedin FIG. 9. FIG. 9 uses the standard single letter code for amino acidsexcept for Hyp which is denoted by [Du et al. (1994) Plant Cell 6:1643],and the standard three letter code for sugars, except for glucuronicacid which is denoted as GlcA. This model depicts a symmetricaldistribution of arabinosides and polysaccharide substituents which isdirected by the palindrome-like arrangement of the Hyp residues in thepeptide backbone; Ser-0 is the palindromic center. However degeneratevariations occur (Table 6). The inventors base this structure oncompositional and linkage analyses of the isolated Hyp-polysaccharidefraction (Tables 7 & 8) [Qi et al. (1991) supra] and on thepentasaccharide side-chain structure elucidated for crude gum arabic byDefaye and Wong [Defaye and Wong (1986) Carbohydrate Research 150:221](corresponding to Rha_(t), Ara_(t), 3-Ara, 4-GlcA, and 2,3,6-Gal inTable 9).

Hydroxyproline-O-glycosidic linkages are stable in base [Lamport (1967)Nature 216:1322; Miller et al. (1972) Science 176:918; Pope (1977) PlantPhysiology 59:894], in contrast to other O-glycosylated hydroxyaminoacids such as serine and threonine, which undergo β-elimination [Lamportet al. (1973) Biochemical Journal 133:125]. Therefore, alkalinehydrolysis was used to isolate and characterize Hyp-arabinogalactanpolysaccharides from GAGP as demonstrated earlier [Qi et al. (1991)supra].

Compositional analysis of the small Hyp-polysaccharides isolated fromGAGP after fractionation of the alkaline hydrolysate on Sephadex G-50(FIG. 10; Table 8) indicated a content of 5158 nM sugar.

TABLE 8 Glycosyl compositions of intact GAGP and a GAGPHyp-polysaccharide isolated from GAGP base hydrolysates GAGP[Qi et al.(1991) Glycosyl Residue supra] GAGP Hyp-polysaccharide Mol % Ara 36 38Gal 46 34 Rha 10 13 GlcUA 9 15In FIG. 10, assay of Hyp across the recovered fractions indicated abroad size range for the Hyp-polysaccharide (fractions 17-32). Fractions27-30 were collected for linkage and composition analyses. Hyparabinosides and non-glycosylated Hyp eluted in fractions 33-42.Corresponding quantitative Hyp assays showed a total of 220 nm Hyp inthe peak isolated and analyzed (FIG. 10). The molar ratio of 220 nmHyp:5156 nm sugar indicated a ˜23-residue rhamnoglucuronoarabinogalactanHyp-polysaccharide substituent in this fraction. Methylation analysis ofthe polysaccharide (Table 9) showed linkages consistent with the modelfeatured in FIG. 9, but containing 21-22 sugar residues rather than the23 featured in FIG. 9.

TABLE 9 Glycosyl linkages of Intact GAGP and a GAGP Hyp-polysaccharideisolated from the GAGP base hydrolysate Glycosyl Linkage GAGP GAGPHyp-Polysaccharide Mol % t-Rha 6.7 10.4 (2)* 2,3,4-Rha 3.3 0.0 t-Ara (f)13.3 16.2 (4) t-Ara (p) 1.7 2.3 (0-1) 2-Ara (f) 2.5 0.0 3-Ara (f) 8.311.0 (2-3) 4-Ara (p) or 5-Ara (f) 1.7 0.0 2,4-Ara or 2,5-Ara (f) 0.8 0.02,3,4-Ara or 2,3,5-Ara (f) 2.5 0.0 t-Gal 5.8 11.8 (3) 2-Gal 0.8 0.03-Gal 2.7 4.5 (1) 4-Gal 0.8 0.5 6-Gal 2.5 2.4 (0-1) 3,4-Gal 2.5 7.7 (2)3,6-Gal 11.7 12.7 (3) 3,4,6-Gal 10.0 9.4 (2) 2,3,6-Gal 3.3 0.02,3,4,6-Gal 5.8 0.0 t-GlcUA 1.7 0.9 4-GlcUA 7.5 10.2 (2) 3,4-GlcUA 1.70.0 2,4-GlcUA 0.8 0.0 2,3,4-GlcUA 0.8 0.0 4-Glc 0.8 0.0 *Estimatednumber of residues/polysaccharide.Based on the above data, the inventors conclude that each smallpolysaccharide contains two pentasaccharide side chains (Gal, Ara₂,GlcA, Rha) arranged along a ˜7-residue (1-3)β-D-galactan backbone helixwhich also contains monosaccharide side chains of Ara and Gal.

Data presented herein demonstrates that the linkage analyses of bothHyp-polysaccharide and GAGP (Table 9) are similar, thus providingevidence of similarity between GAGP and gum arabic polysaccharides.These results suggest that the larger Hyp-polysaccharides (FIG. 10) maybe comprised of repeat units containing approximately 12 galactoseresidues/repeat. Hence, without limiting the invention to any particulartheory or mechanism, the inventors estimate that as many as fiveside-chains (˜40 sugars) occur in the larger arabinogalactan moietieswhich eluted in fractions 18-26 from the G-50 Sephadex column (FIG. 10).The inventors further believe that GAGP and other AGP sensitivity toalkaline degradation involves peptide bonds rather than glycosidiclinkages.

Example 19 Construction of 8, 16, 20, 32, and 64 Repeats of Gum ArabicMotifs and Expression in Plant Cells

This Example discloses construction of synthetic genes for theexpression of gum arabic glycoprotein repeats based on the invention'sconsensus sequences. The genes had 8, 20, 32, or 64 contiguous units oftwo motifs [motif 1 (SEQ IDNO:143)=Ser-Hyp-Hyp-Hyp-Hyp-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His;motif 2 (SEQ IDNO:144)=Ser-Hyp-Hyp-Hyp-Thr-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Hyp-Gly-Pro-His],each of which is encompassed by the invention's consensus sequence. The64 contiguous units [i.e., (motif 1-motif 2)₃₂] were constructed using amodification of the previously described [Lewis et al. (1996) ProteinExpression & Purification 7:400-406] strategy involving compatible butnonregenerable restriction sites, which allowed construction of verylarge inserts with precise control over the number of DNA repeat number.

1. Site-Directed Mutagenesis of pUC18 to Eliminate BsrFI RestrictionSite from the Amp^(r) Gene

Plasmid pUC18 has an endogenous BsrFI site in the Amp^(r) gene. Thissite was eliminated by mutation to make the plasmid amenable tosubcloning of the XmaI-BsrFI synthetic gene fragments, using the PCRcore system I kit (Promega). The PCR Primer 1: (upstream primer) had thesequence (SEQ ID NO:204) GATACCGCGAGACCCACGCTC ACCAGCTCC; this primerwas designed from nucleotides 1756 to 1785 of pUC18 except for 1substitution (A for G) at position 1780 (bolded and underlined). Thischanges one Ala codon (GCC) for another (GCT), retaining the Amp^(r)amino acid sequence while mutating the BsrFI site. PCR Primer 2:(downstream primer) had the sequence (SEQ ID NO:205)CTCGGTCGCCGCATACACTAT and was designed from nt 2220 to 2198 of pUC18.The PCR reaction conditions were 2 min @ 95° C., 30 sec @ 95° C., 1 min@ 48° C., 1 min @ 72° C. (30 cycles), 5 min @ 72° C. PCR products wereseparated on a 1.5% agarose gel. The 464 bp PCR fragment was extractedfrom the gel using the QIAEX II gel extraction kit. The isolatedfragment was restricted and subcloned into pUC18 as a ScaI-BpmIfragment. The new plasmid was designated MpUC18 and has an activeAmp^(r) gene and no BsrFI site.

2. Synthesis of Gum Arabic Glycoproteins (GAGP) Repeats Using MutuallyPriming Oligonucleotides

DNA encoding gum arabic glycoprotein contiguous units of motif 1 linkedto motif was constructed using previously described methods [CurrentProtocols in Molecular Biology section 8.2.8-8.2.10]. A DNA fragmentencoding the two GAGP motifs was synthesized by primer extension of twopartially overlapping synthetic oligonucleotides: First oligonucleotide(SEQ ID NO:206): 5′-G GCA AGC TTC CGG AGT GCC GGC CCT CAT AGC CCA CCTCCA CCA TTA TCA CCA TCA CCT ACT CCA ACT CCT CCT TTG GGA CCA CAC AG-3′;second oligonucleotide (SEQ ID NO:207): 5′-GGT CCC GGG GGG TGG TGT TGGGGT TGG TGA AGG GGA AAG TGT AGG GGG TGG ACT GTG TGG TCC CAA AGG AGG-3′.The oligonucleotides (0.05 nm of each) were heated for 5 min @ 95° C.,annealed for 5 min @ 48° C., then extended by DNA polymerase I Klenowfragment (Promega) for 30 min @ 37° C. The reaction was stopped byheating 10 min at 75° C. and the buffer was exchanged via a SephacrylS-200 column (Pharmacia Microspin™). The plasmid was then subcloned intoMpUC18 as a Hind I-XmaI fragment. The plasmid was sequenced with thepUC/M13 forward primer (17-mer).

3. Multiplication of GAGP Internal Repeat Using NonregenerableRestriction Sites

Synthetic genes containing controlled numbers of GAGP repeats weresynthesized as follows, and as illustrated in FIG. 15. MpUC18 containingthe PCR product described above (two GAGP motifs as shown in FIG. 16A)(designated MpUC gum-2) was divided between two tubes. MpUC gum-2 intube 1 was restricted with Seal and BsrFI; MpUC gum-2 in tube 2 wasrestricted with Seal and XmaI. The digests were separated on a 1%agarose gel. The 1884 kb band from tube 1 (ScaI/BsrFI digest) and the1044 kb band from the tube 2 (ScaI/XmaI digest) were excised from thegel, combined and ligated together. The resulting plasmid (MpUC gum-4)contained 4 GAGP internal repeats [i.e., (motif 1-motif 2)₂] (FIG. 16B).This strategy was successfully used to build plasmids containing 8, 16,20, 32, and 64 internal repeats of GAGP.

4. Subcloning of Synthetic Gum Repeats into pUC ss-EGFP Plasmid

The gum genes (gum-8, gum-20, and gum-32) were removed from MpUC18plasmid as BspEI/SacI fragments and subcloned into pUC ss-EGFP plasmidbehind the signal sequence. During this subcloning, EGFP was removedfrom pUC ss-EGFP as XmaI/SacI fragment. XmaI and BspEI restriction sitesare compatible but nonregenerable.

The next subcloning was done to put the EGFP gene in frame behind thegum sequences. pUC ss-EGFP plasmid was cut with XmaI and treated withMung Bean endonuclease (New England Biolabs). The enzymes wereinactivated by phenol/chloroform extraction followed by ethanolprecipitation. Then plasmid was cut with Sad. The EGFP fragment isolatedafter restriction was subcloned into pUC ss-gum plasmids which was cutwith SmaI/SacI restriction enzymes. The signal sequence-syntheticgene-EGFP fragments were removed from MpUC18 plasmid as BamHI/SacIfragments and subcloned into pBI121, replacing the β-glucuronidasereporter gene. The MpUC ss-gum₂₀-EGFP and MpUC ss-gum₃₂-EGFP plasmidswere sequenced with pUC/M13 forward (17 mer) primer and with GFP primerGAAGATGGTGCGCTCCTGGACGT (SEQ ID NO:226) from nucleotide 566 tonucleotide 588 of pEGFP.

5. Transformation of Tobacco Cultured Cells, Tobacco Leaf Discs, andTomato Cultured Cells, and Expression of Multiple GAGP Internal Repeats

The expression vectors contained an extension signal sequence or atomato signal sequence for transport of the constructs through theER/Golgi for posttranslational modification, as well as GreenFluorescent Protein (GFP) as a reporter protein as described below.

A. Extensin Signal Sequence

Transformation vectors were derived from pBI121 (Clontech). Thesevectors contained an extensin signal sequence (SS) as well as GreenFluorescent Protein (GFP) as a reporter protein. 8, 20, 32, and 64internal repeats of GAGP were inserted between the signal sequence andGFP to yield plasmids SS-GAGP₈-EGFP, SS-GAGP₂₀-EGFP, SS-GAGP₃₂-EGFP, andSS-GAGP₆₄-EGFP, respectively. Because preliminary data showed that thegene encoding the 64 repeats of GAGP was unstable in pBI12, plasmidsSS-GAGP₈-EGFP, SS-GAGP₂₀-EGFP, and SS-GAGP₃₂-EGFP were used to transformAgrobacterium tumefacienes as described supra (Example 9).

B. Tomato LeAGP-1 Signal Sequence

As an alternative to the extensin signal sequence, the tomato LeAGP-1signal sequence was used. Cloning of the LeAGP-1 signal sequence was asfollows using the sense primer 5′-CTC TTT TTC TCT G^(↓)GA TCC GGT CTATAT TTT CTT TTA GC-3′ (SEQ ID NO:227) (Tm: 68° C.) with the arrowshowing the BamH1 restriction site, and the antisense primer 5′-CGG GTGCTG C^(↓)CC GGG TTG TCT GAC CCG TGA CAC TTG C-3′ (SEQ ID NO:228) (Tm:80° C.) with the arrow showing the XmaI restriction site.

PCR was carried out using 52.8 pmol of sense primer and 47 pmol ofantisense primer. The LeAGP-1 signal sequence template (0.01 μg) wasadded together with the PCR mixture. The reaction solution was coveredwith oil and the incubation was at 95° C. 5 min (circle one); 95° C. 45sec, 58° C. 1 min, 74° C. 1 min (circle 2-30); 74° C. 5 min. 20 μl outof 50 μl total PCR solution was removed and purified using 2% agarosegel. The PCR product was 127-bp in size and was isolated by usingQIAEXII kit. This fragment was digested as follows at 37° C. overnight:

Purified PCR fragment 100 ng pUC-SS^(Tob)GFP 200 ng BamH1 5 u BamH1 2 uXmal 4 u Xmal 2 u Buffer B 3 μl Buffer B 3 μl Add water to 30 μl Addwater to 30 μl

The digested samples were run on an agarose gel. The vector and fragmentwere cut from the gel and were isolated with the QIAEXII kit. Theligation reaction [pUC-SS^(Tob)GFP(BX) 100 ng, PCR fragment (BX) 20 ng,Ligase Buffer (10×) 1 μl, Ligase 1 μl] was incubated at 10° C.overnight.

Transformation was carried out and 3 clones were cultured separately inLB media containing ampicillin overnight. Plasmids were isolated fromthe transformed cells and digested with BamH1 and Xma1 to confirm thatthe fragments were 99 bp long. The plasmid containing the tomato signalsequence was named pUC-SS^(Tom)-GFP.

Plasmids containing the tomato signal sequence in tandem with repeatingGAGP sequences and with EGFP as a reporter gene is used to transformAgrobacterium tumefacienes as described supra (Example 9).

The transformed Agrobacterium cells were used to transform tobaccocultured cells as described above (Example 10). Transformed cells wereselected by detection of fluorescent cells which express GFP.

Transformed Agrobacterium cells will be used to transform tomatocultured cells and tobacco discs as described above (Examples 11 and 17,supra). Transformed cells will be selected by detection of fluorescentcells which express GFP. Successful expression of 8, 20, and 32,internal repeats of GAGP in tobacco cultured cells, tobacco leaf discs,and tomato cultured cells will be confirmed using the methods describedin the above Examples.

Example 20 Construction of Genes and Vectors Containing Contiguous andNoncontiguous Hydroxyproline Glycomodules (SP)₃₂, (GAGP)₃, (SPP)₂₄,(SPPP)₁₆, and (SPPPP)₁₈

This Example describes construction of three plasmids, each encoding atobacco signal sequence and EGFP, as well as subcloning of (SP)₃₂,(GAGP)₃, EGFP, (SPP)₂₄, (SPPP)₁₆, (SPPPP)₁₈. In the three plasmidsdescribed here, the signal sequence was used to direct the productsthrough the ER and Golgi, then out to the extracellular matrix[Goodenough et al. (1986) J. Cell Biol. 103, 403; Gardiner & Chrispeels(1975) Plant Physiol. 55, 536-541]. Two of the plasmids also contained asynthetic gene (SEQ ID NOs:112, 113, 115, 116) encoding either six(Ser-Pro) internal repeat units (SEQ ID NO:117) or three (GAGP) internalrepeat units (SEQ ID NO:122) (FIG. 11) sandwiched between the signalsequence and gene-enhanced green fluorescent protein (EGFP). In FIG. 11,internal repeat oligonucleotide sets encoding Ser-Pro repeats or theGAGP sequence were polymerized head-to-tail in the presence of the5′-linker set [SEQ ID NOs:120 and 121 which encode SEQ ID NO:122].Following ligation, the 3′-linker [SEQ ID NOs:123 and 124 which encodeSEQ ID NO:125] was added and the genes then restricted with BamHI andEcoRI and inserted into pBluescript II SK. The signal sequence (SEQ IDNOs:118 and 119) was built by primer extension of the overlappingoligonucleotides featured here. The overlap is underlined.

The conserved (Ser-Hyp)_(n) motif was chosen because it occurs both ingreen algae (Chlamydomonas) and in higher plant AGPs. This noncontiguousHyp motif is of particular interest because it also occurs together witha contiguous Hyp motif in the consensus sequence of GAGP which containsboth oligoarabinoside and polysaccharide addition sites.

The signal sequence (FIG. 11) was modeled after an extensin signalsequence from Nicotiana plumbaginifolia; mutually primingoligonucleotides were extended by T7 DNA Polymerase and the duplexplaced in pUC18 as a Bam HI-Sst I fragment. Construction of a givensynthetic gene involved the polymerization of three sets of partiallyoverlapping, complementary oligonucleotide pairs as described earlier(FIG. 11). The following subclonings were required to create DNAfragments/restriction sites which allowed facile transfer of the SignalSequence-synthetic gene-enhanced green fluorescent protein (EGFP) unitto the plant transformation vector pBI121 (Clontech): The syntheticgenes were placed in pBluescript II SK (Stratagene) as BamHI-EcoRIfragments and then subcloned the genes into pEGFP (Clontech) asBamHI-AgeI fragments preceding the EGFP gene (Tsien, R. Y. (1998) Annu.Rev. Biochem. 67, 509-544; Haseloff, J., Siemering, K. R., Prasher, D.C. & Hodge, S. (1997) Proc. Natl. Acad. Sci. 94, 2122-2127.22). Thesynthetic gene-EGFP fragments were then subcloned into pBluescript II KS(Stratagene) as XmaI/NotI fragments, removed as XmaI-SstI fragments andsubcloned into pUC18 behind the signal sequence. DNA sequences wereconfirmed by sequence analysis before insertion into pBI121 asBamHI/SstI fragments, replacing the b-glucuronidase reporter gene. Allconstructs were under the control of the 35S cauliflower mosaic viruspromoter. The oligonucleotides were synthesized by Lifesciences(Gibco/BRL). An Ala for Pro/Hyp substitution at residue 8 of the gumarabic glycoprotein (GAGP) internal repeat module (SEQ ID NO:208)(Ser-Pro-Ser-Pro-Thr-Pro-Thr-Pro-Pro-Pro-Gly-Pro-His-Ser-Pro-Pro-Pro-Thr-Leu)was inadvertently introduced during synthesis by a G for C basesubstitution in the sense strand.

The following is a more detailed description of the protocol used tosubclone (SP)₃₂, (GAGP)₃, EGFP, (SPP)₂₄, (SPPP)₁₆, (SPPPP)₁₈. Briefly,Everything was first built and sequenced in pUC18, then transferred as ablock (i.e., signal sequence-synthetic gene-EGFP) to pBI121. Theconstructs in pBI121 were not sequenced. The pBI121 plasmids were usedto transform Agrobacterium and the transformed Agrobacterium was used totransform the plant cells, as described infra in Example 21.

1. Synthesis of the Signal Sequence

The signal sequence was assembled by using mutually primingoligonucleotides [Current Protocols in Molecular Biology,” (1995) pages8.2.8-8.2.10]. Oligonucleotides (0.2 nmol, 0.2 nmol) were annealed (5min at 70° C. followed by 5 min at 40° C.) and extended by DNApolymerase I (Klenow) large fragment (Promega) (30 min at 37° C.). Thereaction was stopped by heating 10 min at 75° C. The resulting DNAfragment was cut with BamHI and SstI enzymes and was placed in pUC18plasmid. The plasmid was sequenced with pUC/M13 forward (17 mer) primer.

2. Synthesis and Subcloning of Synthetic Genes

Oligonucleotides were synthesized and SDS-PAGE purified by Gibco-BRL orIntegrated DNA Technologies Inc. They were dissolved in water atappropriate concentrations.

A. (SP)₃₂ and (GAGP)₃ Synthesis and Subcloning

i. Annealing Reaction

Oligonucleotide-pairs were combined in eppendorf tubes as follows:

a) 5.5 μl internal repeat sense oligonucleotide (0.5 nmol/μl)

5.5 μl internal repeat antisense oligonucleotide (0.5 nmol/μl)

11 μl T4 ligase 10× ligation buffer

b) 2 μl 5′-end sense linker (0.05 nmol/μl)

2 μl 5′-end antisense linker (0.05 nmol/μl)

1 μl water

5 μl T4 ligase 10× ligation buffer

c) 2 μl 3′-end sense linker (1 nmol/μl)

2 μl 3′-end antisense linker (1 nmol/μl)

1 μl water

5 μl T4 ligase 10× ligation buffer

All tubes were heated 5 min at 90-95° C. Then they were cooled to 45° C.over next 3 hours and kept at 45° C. for 2 more hours.

ii. Oligonucleotide Polymerization

10 μl of the internal repeat pair was combined with 10 μl of the 5′-endlinker pair (15:1 molar ratio). This mixture was incubated 3 hour at 17°C. Then, 80 μl of water (to receive 1× concentration of ligation buffer)and 2 μl of T4 DNA ligase (4,000 U) were added. The ligation reactionwas incubated 36 hours at 12-15° C. The extent of polymerization waschecked on 2.2% agarose gel.

The 5′-end linker-internal repeat polymers were capped with the 3′-endlinker. 5 μl of the 3′-end linker were added to 50 μl of ligationreaction from the step above. The mixture was heated to 30° C. (todestroy unspecific hybridization), and incubated at 17° C. for 3 hours.20 μl of water and 2 μl T4 DNA ligase (4,000 U) were added and theligation reaction was incubated at 12-15° C. for 36 hours. The reactionwas stopped by heating at 65° C. for 10 min.

The constructs were ethanol precipitated, washed with 70% ethanol andair dried. The pellet was dissolved in 80 μl of water. 10 μl was usedfor restriction with EcoRI (10 Units) and BamHI (20 Units). TheSephacryl S-400 column (Pharmacia Microspin™) was used to remove saltsand small oligonucleotide fragments. Qiaquick Nucleotide removal kit(Qiagen) was used to remove enzymes. The resultant fragments wereinserted in pBluescript II SK plasmid (Stratagene) The selection ofclones was done by white-blue assay. The structure of synthetic geneswas checked by sequencing with pUC/M13 forward (17 mer) primer.

iii. Subcloning

The synthetic genes were first removed from pBluescript II SK(Strategene) as BamHI/AgeI fragments and subcloned in pEGFP (Clontech).(This step allowed directional cloning). The synthetic gene—EGFPfragments were removed from pEGFP as XmaI/NotI fragments and subclonedin KS (Stratagene) (This step was done to obtain SstI site at the end ofEGFP). The synthetic gene—EGFP fragments were removed from KS asXmaI/SstI fragments and subcloned in pUC-signal sequence plasmid behindthe signal sequence. The structure of the synthetic genes was checked bysequencing with pUC/M13 forward (17 mer) primer. The signalsequence-synthetic gene-EGFP fragments were removed from pUC18 plasmidas BamHI/SstI fragments and subcloned in pBI121 (Clontech).

iv. EGFP Subcloning

The EGFP fragment was removed from pEGFP as XmaI/NotI fragments andsubcloned in KS. (This step was done to obtain SstI site at the end ofEGFP). The EGFP fragment was removed from KS as XmaI/SstI fragments andsubcloned in pUC-signal sequence plasmid behind the signal sequence. Thesignal sequence—EGFP fragment was removed from pUC18 plasmid asBamHI/SstI fragments and subcloned in pBI121.

B. (SPP)₂₄, (SPPP)₁₆, (SPPPP)₁₈, Palindromic Repeat Synthesis andSubcloning

i. Annealing Reaction

Oligonucleotide-pairs were combined in eppendorf tubes as follows:a) 2 μl internal repeat sense oligonucleotide (0.25 nmol/μl)

2 μl internal repeat antisense oligonucleotide (0.25 nmol/μl)

3 μl T4 ligase 10× ligation buffer

23 μl water

b) 1 μl 5′-end sense linker (0.5 nmol/μl)

1 μl 5′-end antisense linker (0.5 mmol/μl)

4 μl T4 ligase 10× ligation buffer

34 μl water

c) 2 μl 3′-end sense linker (0.25 nmol/μl)

2 μl 3′-end antisense linker (0.25 nmol/μl)

3 μl T4 ligase 10× ligation buffer

23 μl water

All tubes were heated to 90-95° C. for 5 min. Then they were cooled toannealing temperature ( ) over next 30 min and kept at that temperaturefor 1 more hour.

ii. Oligonucleotides Polymerization

25 μl of internal repeat pair was combined with 20 μl of 5′-end linkerpair (1.5:1 ratio). The mixture was heated to 35° C. to destroy circularstructures formed by internal repeat pair. After cooling to 20° C. 0.5μl of T4 DNA ligase (1.5 U) was added. The ligation reaction wasincubated 3 hours at 20° C. 3 μl of ligation mixture was used to checkthe extent of polymerization on 2% agarose gel.

The 5′-end linker-internal repeat polymers were capped with 3′-endlinker. I added 15 μl of the 3′-end linker to 40 μl of ligation reactionfrom step above and 0.5 μl of T4 DNA ligase (1.5 U). The ligationreaction was incubated 3 hours at 20° C. The reaction was stopped byheating at 65° C. for 10 min. 3 μl of ligation mixture was used to checkthe extent of polymerization on 2% agarose gel. The Sephacryl S-200column (Pharmacia Microspin)™ was used to remove salts. 4-6 μl ofsolution was used for restriction with EcoRI (10 Units) and BamHI (20units). After restriction, 150-bp to 500-bp fragments were cut out of 2%agarose gel. QIAEX II gel extraction kit was used to isolate fragmentsfrom the gel.

The resultant fragments were inserted in pUC18 plasmid. The selection ofclones was done by white-blue assay. The structure of synthetic geneswas checked by sequencing with pUC/M13 forward (17 mer) primer.

iii. Subcloning

The synthetic genes were removed from pUC18 as XmaI/NcoI fragments andsubcloned behind the signal sequence and in front of EGFP in pUC-signalsequence-EGFP plasmid. The signal sequence-synthetic gene-EGFP fragmentswere removed from pUC18 plasmid as BamHI/SstI fragments and subcloned inpBI121.

The above protocols yielded pBI121 expression constructs in which genesencoding each of (SP)₃₂, (GAGP)₃, EGFP, (SPP)₂₄ (SPPP)₁₆, (SPPPP)₁₈palindromic repeats were ligated to sequences encoding the signalsequence and EGFP.

Example 21 Transformation of Tobacco Cells and Selection of TransformedCell Lines

This Example describes transformation of suspension cultured tobaccocells with the expression vectors of Example 20 and selection oftransformants which express green fluorescent protein.

Suspension cultured tobacco cells (Nicotiana tabacum, BY2) weretransformed with Agrobacterium tumifaciens strain LBA4404 containing thepBI121-derived plant transformation vector. Transformed cell lines wereselected on solid Murashige-Skoog medium (Sigma #5524) containing 100mg/mL kanamycin. Timentin was initially included at 400 mg/mL to killAgrobacterium. Cells were later grown in 1 L flasks containing 500 mLShenck-Hildebrand medium (Sigma #6765) and 100 mg/mL kanamycin, rotatedat 100 rpm on a gyrotary shaker.

After transformation of tobacco cells with Agrobacterium harboring theplant transformation plasmid pBI121 outfitted with eitherSig-(GAGP)₃-EGFP, Sig-(Ser-Pro)₃₂-EGFP, or Sig-EGFP (described inExample 20), selection on solid medium and subsequent growth in liquidculture yielded cells bathed in a green fluorescent medium. Thefluorescence in these highly vacuolated, cultured cells surrounds thenuclei, but is not within judging by optical sections (not shown). Themicroscope was a Molecular Dynamics Sarastro 2000 confocal laserscanning microscope using a 488 nm laser wave length filter, 510 nmprimary beam splitter and a 510 nm barrier filter.

This Example demonstrates that inclusion of the EGFP reporter proteinfacilitated the selection of transformed cells and subsequent detectionof the expression products during isolation (FIGS. 13 & 14). EGFPfluorescence in the growth medium was also a visual demonstration of Sigefficacy in directing secretion. The absence of any obvious cell lysisin the cultures and excellent product yields of the glycosylatedexpression products confirmed that the green fluorescence representedbona fide secretory products. Interestingly, EGFP without a glycomodulewas secreted at very low levels, perhaps due to lower solubility.

Example 22 Isolation of (Ser-Hyp)₃₂-EGFP, (GAGP)₃-EGFP, (SPP)₂₄-EGFP,(SPPP)₁₆-EGFP, and (SPPPP)₁₈-EGFP from Transformed Cells

This Example describes the isolation of sequences containing contiguousand noncontiguous Hyp residues from the growth medium of tobacco cellstransformed with expression vectors which express these polypeptides.

Culture medium of cells described in Example 21, supra, was harvested 7to 21 days after subculture, and the gene products were purified by gelpermeation and reverse-phase chromatography (FIGS. 13 and 14) asfollows. Culture medium was concentrated ten fold via rotovapping, theninjected onto a Superose-12 gel filtration column (Pharmacia)equilibrated in 200 mM sodium phosphate buffer, pH 7, and eluted at aflow rate of 1 mL/min. EGFP fluorescence was monitored by aHewlett-Packard 1100 Series flow-through fluorometer (Excitation=488 nm;Emission=520 nm). The Superose-12 column was calibrated with molecularweight standards (BSA, insulin, catalase, and sodium azide). FluorescentSuperose-12 fractions were injected directly onto a Hamilton PRP-1reverse phase column and gradient eluted at a flow rate of 0.5 mL/min.Start buffer consisted of 0.1% TFA (aq) and elution buffer was 0.1%TFA/80% acetonitrile (aq). The sample was repeatedly injected (0.5mL/minute) onto the column over 35 min, then eluted with a gradient ofelution buffer (0-70%/135 min). Native GAGP was isolated from gum arabicnodules as described by Qi et. al. Endogenous tobacco AGPs were isolatedas by PRP-1 reverse-phase and the results are shown in FIG. 13. FIG. 13shows PRP-1 reverse-phase fractionation of the Superose-12 peakscontaining (A) (Ser-Hyp)₃₂-EGFP, (B) (GAGP)₃-EGFP, and (C) Glycoproteinsin the medium of non-transformed tobacco cells. Endogenous tobacco AGPseluted between 47 and 63 minutes; extensins eluted at ˜67 min. (C)Control medium collected from non-transformed tobacco cells was firstfractionated on Superose-12 and the fractions eluting between 47 and 63min collected for further separation on PRP-1 to determine if anyendogenous AGPs/HRGPs co-chromatographed with (Ser-Hyp)₃₂-EGFP or with(GAGP)₃-EGFP, which they did not.

Six cell lines examined [three each of (Ser-Hyp)₃₂-EGFP and(GAGP)₃-EGFP] synthesized fluorescent glycoproteins of comparable sizes,although product yields between lines differed as much as ten-fold. Forproduct characterization high-yielding lines were chosen which typicallyproduced 23 mg/L of (Ser-Hyp)₃₂-EGFP and 8 mg/L of (GAGP)₃-EGFP afterisolation.

FIG. 12 shows Superose-12 gel permeation chromatography withfluorescence detection of (A) culture medium containing(Ser-Hyp)₃₂-EGFP, (B) (GAGP)₃-EGFP medium concentrated four-fold, (C)Medium of EGFP targeted to the extracellular matrix (concentratedten-fold), and (D) 10 mg standard EGFP from Clontech. Not shown is thefractionation of medium from non-transformed tobacco cells, which gaveno fluorescent peaks consistent with the results discussed above.Superose-12 fractionation of the two fusion glycoproteins (FIG. 12)compared to molecular weight standards (not shown) indicated mass rangesof ˜95-115 kD for (Ser-Hyp)₃₂-EGFP and ˜70-100 kD for (GAGP)₃-EGFP. Theabove data demonstrates successful isolation of GAGP sequences fromcells which had been transformed with vectors that are capable ofexpressing these sequences.

The recombinant (SPP)₂₄-EGFP, (SPPP)₁₆-EGFP, and (SPPPP)₁₈-EGFP wereisolated from transformed cells as described supra in this Example withrespect to (SP)₃₂-EGFP and (GAGP)₃-EGFP.

Example 23 Characterization of Glycoproteins Isolated from TransformedCells

The glycoproteins isolated from transformed tobacco cells as describedin Example 22 were characterized as follows, and were shown to be newarabinogalactan-proteins (AGPs).

1. Co-Precipitation with Yariv Reagent

(Ser-Hyp)₃₂-EGFP, (GAGP)₃-EGFP, tobacco AGPs, and native GAGP wereco-precipitated with the Yariv reagent as described earlier. Both(Ser-Hyp)₃₂-EGFP and (GAGP)₃-EGFP precipitated with Yariv reagent (Table10), which is a specific property of b-1,3-linkedarabinogalactan-proteins.

TABLE 10 Yariv Assay of (Ser-Hyp)₃₂-EGFP and (GAGP)₃-EGFP Absorbenciesat 420 nm Sample Standards Weight Tobacco (μg) (Ser-Hyp)₃₂-EGFP(GAGP)₃-EGFP GAGP AGP 20 0.16 0.27 0.51 0.16 50 0.45 0.56 1.22 0.38 1001.00 1.21 2.69 0.85

2. Hydroxyproline Glycoside Profiles

Hyp-glycoside profiles were determined as described by Lamport andMiller. We hydrolyzed 5.8-12.2 mg (Ser-Hyp)₃₂-EGFP or (GAGP)₃-EGFP in0.44 N NaOH and neutralized the hydrolysate with 0.3 M HCl beforeinjection onto a C2 cation exchange column. Each Hyp residue in(Ser-Hyp)₃₂-EGFP contained an arabinogalactan-polysaccharidesubstituent; (GAGP)₃-EGFP Hyp residues contained arabinooligosaccharidesubstituents in addition to arabinogalactan-polysaccharide (Table 11).

TABLE 11 Hyp-Glycoside Profiles of (Ser-Hyp)₃₂-EGFP and (GAGP)₃-EGFP andNative Crude GAGP % of Total Hyp Native Hyp-Glycoside (Ser-Hyp)₃₂-EGFPGAGP₃-EGFP GAGP Hyp-polysaccharide 100 62 25 Hyp-Ara 0 4 10 Hyp-Ara₂ 012 17 Hyp-Ara₃ 0 7 31 Hyp-Ara₄ 0 4 5 Non-glycosylated Hyp 0 11 12The Hyp-glycoside profile of (Ser-Hyp)₃₂-EGFP gave a single peak of Hypcorresponding to Hyp-polysaccharide. Significantly, peaks correspondingto Hyp-arabinosides and non-glycosylated Hyp were absent. Importantly,this indicates that all of the Hyp residues in the glycomodule werelinked to a polysaccharide.

In contrast, (GAGP)₃-EGFP yielded peaks corresponding toHyp-arabinosides, non-glycosylated Hyp, and Hyp-polysaccharide. However,(GAGP)₃-EGFP (FIGS. 11 & 15) was designed with fewer contiguous Hypresidues than the consensus sequence of native GAGP and yielded fewerHyp arabinosides consistent with fewer contiguous Hyp arabinosylationsites [Kieliszewski & Lamport (1994) Plant J. 5, 157-172; Kieliszewskiet al. (1992) Plant Physiol. 98, 919-926; Kieliszewski et al. (1995) J.Biol. Chem. 270, 2541-2549]. In addition, occasional incompletehydroxylation of the middle proline residue in the Pro-Pro-Pro motif(FIG. 14B) converted a region of contiguous Hyp (putativearabinosylation site) to noncontiguous Hyp (polysaccharide additionsites). Control EGFP targeted to the extracellular matrix contained noHyp, hence no glycosylated Hyp, judging by manual Hyp assays.

The following describes the sequences of the genes and the expressedproteins as well as the Hyp-glycoside glcoprotein profile which wereobtained using the SPP, and SPPP modules described in Table 4, as wellas the SPPPP module.

A. Ser-Pro-Pro Gene

The [SPP]_(n) module described in Table 4, item 2.a. was expresed usingthe following sequence:

GGA TCC GCA ATG GGA AAA ATG GCT TCT CTA TTT GCC ACA TTT TTA GTG GTT TTAG   S   A   M   G   K   M   A   S   L   F   A   T   F   L   V   V   LGTG TCA CTT AGC TTA GCA CAA ACA ACC CGG GCC [CCA CCT TCA CCC CCA TCT CCAV   S   L   V   L   A   Q   T   T   R   A   [P   P   S   P   P   S   PCCG AGT CCA CCA TCC]₆ CCA CCT TCA TCC ATG GCA TAA TAG AGC TCG (SEQ IDNo: 229) P   S   P   P   S  ]₆ P   S   S   M   A   Stop Stop (SEQ ID No:230).

The Ser-Pro-Pro gene expressed the protein sequence[Pro-Hyp-Ser-Hyp-Hyp-Ser-Hyp-Hyp-Ser-Hyp-Hyp-Ser]₆ (SEQ ID NO:231) whichhad the following Hyp-glycoside profile: Hyp (51% of total Hyp), Hyp-Ara(0% of total Hyp), Hyp-Ara₂ (0% of total Hyp), Hyp-Ara₃ (49% of totalHyp), Hyp-Ara₄ (0% of total Hyp), Hyp-Polysaccharide (0% of total Hyp).

B. Ser-Pro-Pro-Pro Gene

The [SPPP]_(n) module described in Table 4, item b. was expresed usingthe following sequence:

GGA TCC TCA ACC CGG GCC TCA CCA [CCA CCA CCT TCT CCA CCT CCA TCA CCC CCAG   S   S   T   R   A   S   P   [P   P   P   S   P   P   P   S   P   PCCT TCG CCT CCA CCA TCC]₄ CCT TCC ATG GCA TAA TAG AGC TCG AAT TCG (SEQID NO: 232) P   S   P   P   P   S  ]₄ P   S   M   A   STOP STOP (SEQ IDNO: 233)

The expressed the protein sequence had the following Hyp-glycosideprofile: Hyp (0% of total Hyp), Hyp-Ara (0% of total Hyp), Hyp-Ara₂ (21%of total Hyp), Hyp-Ara₃ (39% of total Hyp), Hyp-Ara₄ (3% of total Hyp),Hyp-Polysaccharide (37% of total Hyp).

C. The Ser-Pro-Pro-Pro-Pro Gene

The [SPPPP]_(n) module was expresed using the following sequence:

GGA TCC TCA ACC CGG GCC TCA CCA [CCA CCA CCT TCA CCT CCA CCC CCA TCTG   S   S   T   R   A   S   P   [P   P   P   S   P   P   P   P   S CCA]₉CCA CCA CCT TCC ATG GCA TTA TAG AGC TCG (SEQ ID NO: 234) P  ]₉P   P   P   S   M   A   Stop Stop (SEQ ID NO: 235)

The expressed the protein sequence had the following Hyp-glycosideprofile: Hyp (7% of total Hyp), Hyp-Ara (2% of total Hyp), Hyp-Ara₂ (8%of total Hyp), Hyp-Ara₃ (52% of total Hyp), Hyp-Ara₄ (31% of total Hyp),Hyp-Polysacchride (0% of total Hyp).

3. Monosaccharide and Glycosyl Linkage Analysis

Monosaccharide compositions and linkage analyses were determined at theComplex Carbohydrate Research Center, University of Georgia as describedearlier. The results are shown in Table 12.

TABLE 12 Glycosyl Compositions of (Ser-Hyp)₃₂-EGFP (GAGP)₃-EGFP, NativeGAGP and Crude Gum Arabic Mol % Crude Glycosyl Native Gum Residue(Ser-Hyp)₃₂-EGFP (GAGP)₃-EGFP^(a) GAGP Arabic Ara 28 23 36 28 Gal 45 4946 37 Rha 8 8 10 13 Xyl 0 2 0 0 GlcUA 19 16 9 17 Mann 1 1 0 0 ^(a)valuescorrected for a small amount of glucose contamination.Gal and Ara accounted for the bulk of the saccharides in both fusionproteins, with lesser amounts of Rha and GlcUA (Table 12); saccharideaccounted for 58% (dw) of (Ser-Hyp)₃₂-EGFP and 48% (dw) of (GAGP)₃-EGFP.Methylation analyses indicated that 3- and 3,6-linked galactose speciesaccounted for 50 mole % of the sugars in (Ser-Hyp)₃₂-EGFP and 46 mole %of (GAGP)₃-EGFP; 2-linked arabinofuranose (Ara (f)) accounted for 1.6and 3.1 mole % respectively; terminal Ara(f) accounted for 20 and 21mole % respectively; 4-arabinopyranose or 5-Ara(f) accounted for 6 and8% respectively; all rhamnose was terminal; and all GlcUA was 4-linked.

The sugar analysis data in Table 12 shows that both fusion glycoproteinshad sugar compositions typical of AGPs: a galactose: arabinose molarratio of ˜2:1 with lesser amounts of glucuronic acid and rhamnose. Thepredominantly 3- and 3,6-linked galactose and terminal arabinofuranosedetermined by methylation analysis, was in keeping with a (-1,3-linkedgalactan backbone having sidechains of arabinose, glucuronic acid andrhamnose [Nothnagel, E. A. (1997) Int. Rev. Cytol. 174, 195-291]. Thevery low amount of 1,2-linked arabinose in (Ser-Hyp)₃₂-EGFP agreed withthe absence of Hyp arabinosides while the presence of 1,2-linkedarabinose in (GAGP)₃-EGFP agreed with the presence of Hyp arabinosidesin its Hyp glycoside profile as they are known to be largely 1,2-linked[Sticher et al. (1993) Plant Physiol. 101, 1239-1247; Akiyama et al.(1980) Agric. Biol. Chem. 44, 2487-2489]. Thus, (GAGP)₃-EGFP containedboth types of Hyp glycosylation consistent with the presence of apolypeptide having contiguous and non-contiguous Hyp as putativearabinosylation and polysaccharide addition sites, respectively.

With respect to the size of attached polysaccharide, Hyp glycosideprofiles showed the molar ratio of Hyp-polysaccharide in each fusionglycoprotein (Table 11). This gives the number of (polysaccharide)-Hypresidues in each glycoprotein molecule. (e.g. Hyp-polysaccharideaccounted for 100% of the Hyp glycosides in (Ser-Hyp)₃₂ i.e. 31-32Hyp-polysaccharide). Glycoprotein size before and after deglycosylationgave an approximate size for the attached polysaccharide. The size ofeach fusion protein before and after deglycosylation was ˜95-115 kDa and34 kDa respectively for (Ser-Hyp)₃₂-EGFP (˜71 kDa carbohydrate), and˜70-100 kDa and 34 kDa respectively for (GAGP)₃-EGFP (˜51 kDacarbohydrate). Judging by the gene sequence (not shown) and FIG. 14,(Ser-Hyp)₃₂-EGFP contains ˜31-32 Hyp residues, all noncontiguous, hencewith an average polysaccharide size of 71 kDa/31=2.2-2.3 kDa whichcorresponds to 14-15 sugar residues (average sugar residue weight of 155calculated from the sugar composition in Table 12) and is consistentwith the empirical formula Gal₆ Ara₃ GlcA₂ Rha based on compositionaldata in Table 12. Similarly, (GAGP)₃-EGFP contains ˜23-25 Hyp residuesof which 62% (Table 11), or ˜15 occur with polysaccharide attached.Hence the polysaccharide approximates 51 kDa/15=3.4 kDa corresponding toabout 22 sugar residues, a modest overestimate as it includes arabinosefrom the Hyp arabinooligosaccharides.

The similarity of these fusion glycoproteins to native GAGP (Table 12)suggests a model for the Hyp-polysaccharide based on the generalarabinogalactan structure [Akiyama et al. (1980) Agric. Biol. Chem. 44,2487-2489; Aspinall & Knebl (1986) Carbohyd. Res. 157, 257-260; Defaye &Wong (1986) Carbohydr. Res. 150, 221-231] of a galactan core with smallsidechains containing rhamnose, arabinose and glucuronic acid. Possiblylarger arabinogalactan polysaccharide can be built up by repeatedaddition [Clarke et al. (1979) Phytochem. 18, 521-540; Bacic et al.(1987) Carbohyd. Res. 162, 85-93] of small ˜12 residue motifsrepresented by the above empirical formula.

4. Hydroxyproline Assay of Secreted EGFP

Secreted EGFP, the product of the Sig-EGFP gene, was isolated by theSuperose-12 fractionation. We removed EGFP from the fusion glycoproteinsby overnight pronase digestion (1% ammonium bicarbonate, 5 mM CaCl₂; 27°C. 1:100 enzyme:substrate ratio) followed by isolation of EGFP by gelpermeation chromatography as described above. After dialysis andfreeze-drying, we assayed Hyp on 0.5 mg EGFP as described earlier. Therewas no Hyp in secreted EGFP or in EGFP removed from the fusionglycoproteins by pronase.

5. Anhydrous Hydrogen Fluoride (HF) Deglycosylation

We deglycosylated 4.5 mg each of (Ser-Hyp)₃₂-EGFP and (GAGP)₃-EGFP inanhydrous HF containing 10% dry methanol for 1 hr at 0° C. then quenchedthe reactions in ddH₂O. After deglycosylation of 4.5 mg of each fusionglycoprotein, we recovered 1 mg of deglycosylated (Ser-Hyp)₃₂-EGFP (i.e.˜23% weight recovery) and 2.2 mg deglycosylated (GAGP)₃-EGFP (i.e. ˜50%recovery).

6. Protein and DNA Sequence Analysis

Protein sequence analysis was performed at the Michigan State UniversityMacromolecular Facility on a 477-A Applied Biosystems Inc. gas phasesequencer. DNA sequencing was performed at the Guelph MolecularSupercentre, University of Guelph, Ontario, Canada. Edman degradationconfirmed the gene sequences and identified which Pro residues had beenhydroxylated to Hyp. In particular, N-terminal sequencing of both(Ser-Hyp)₃₂-EGFP and (GAGP)₃-EGFP (FIG. 14) verified the synthetic genesequences and identified hydroxyproline residues. Occasional incompleteproline hydroxylation has been observed elsewhere [de Blanket al. (1993)Plant Mol. Biol. 22, 1167-1171] and may simply signify a prolylhydroxylase with less than 100% fidelity.

The above data demonstrates that the repetitive Ser-Hyp motif directedthe exclusive addition of arabinogalactan polysaccharide to Hyp in(Ser-Hyp)₃₂-EGFP while Hyp arabinosylation was correlated with thepresence of contiguous Hyp motifs in (GAGP)₃-EGFP. Thus the O-Hypglycosyltransferases of plants seem to resemble the O-Ser and O-Thrglycosyltransferases of animals in their multiplicity and ability todiscriminate based on primary sequence and site clustering [Bacic et al.(1987) supra; Gerken et al. (1997) J. Biol. Chem. 272, 9709-9719].

Example 24 Assay of Emulsifying Activity and Emulsion StabilizingActivity of GAGPs

This Example analyzes the emulsifying activity (EA) and emulsionstabilizing activity (ES) of recombinant (GAGP)₃-EGFP which wasexpressed in the medium of transformed tobacco cell cultures asdescribed above (Example 23). These activities were compared with thosefor bovine serum albumin (BSA), crude gum arabic glycoprotein (crudeGAGP) which was isolated from Acacia senegal, dialyzed gum arabicglycoprotein, and tobacco arabinogalactan-protein (AGP) which contains amixture of at least four different arabinogalactan-proteins. Inaddition, this Example describes the emulsifying activity and emulsionstabilizing activity of (GAGP)₃-EGFP protein fractions which werefractionated on Superose-6 and reverse-phase columns (Example 23), aswell as the effect of size and glycosylation of (GAGP)₃-EGFP onemulsifying activity and emulsion stabilizing activity. All GAGPemulsions used in Tables 14-17, infra, were prepared at a concentrationof 0.5% (w/v).

The emulsifying activity and emulsion stabilizing activity weredetermined using orange oil (Sigma) following essentially themanufacturer's instructions. Freeze-dried glycoproteins were dissolvedin 0.05 M phosphate buffer (pH 6.5) at a concentration of 0.5% (m/v).The aqueous solutions were combined with orange oil in a 60:40 (v/v)ratio. A 1 ml emulsion was prepared in a glass tube at 0° C. with aSonic Dismembrator (Fisher Scientific) equipped with a Microtip probe.The amplitude value was set at 4 and mixing time was set to 1 min.

For the determination of emulsifying ability (EA), the emulsion wasdiluted serially with a solution containing 0.1 M NaCl and 0.1% SDS togive a final dilution of 1/1500. The optical density of the dilutedemulsion was then determined in a 1-cm pathlength cuvette at awavelength of 50 nm and defined as the emulsifying activity (EA). BSAwas used as a positive control. Test samples which showed an emulsifyingactivity which was at least 10%, more preferably at least 50%, and mostpreferably at least 75% of the emulsifying activity of a BSA control aresaid to be “characterized by having emulsifying activity.”

For emulsifying stability, the emulsion was stored vertically in a glasstube for 3 h at room temperature, then the optical density of 1:1500dilution of the low phase of the stored sample was measured. Emulsifyingstability (ES) was defined as the percentage optical density remainingafter 2 hour of storage. BSA was used as a positive control. Testsamples which showed an emulsion stabilizing activity which was at least10%, more preferably at least 50%, and most preferably at least 75% ofthe emulsion stabilizing activity of a BSA control are said to be“characterized by having emulsion stabilizing activity.”

To determine whether (GAGP)₃-EGFP had emulsifying activity and/oremulsion stabilizing activity, this glycoprotein was assayed asdescribed above and its activities were compared with those for bovineserum albumin (BSA), crude gum arabic, dialyzed gum arabic, and tobaccoAGP. The results are shown in Table 13, which demonstrates theemulsifying properties of native gum arabic when compared to BSA, thesynthetic GAGP₃-EGFP, and native tobacco AGPs.

TABLE 13 Emulsions properties of crude Gum Arabic and otherMaterials^(a) Crude Crude Dialyzed Synthetic Tobacco Mate- BSA GAGP GAGPGAGP GAGP^(b) AGP rials (0.5%) (0.5%) (1.0%) (0.5%) (0.5%) (0.5%) EA0.801 0.102 0.472 0.146 0.007 0.035 ES 90.6% 39.7% 83.0% 57.5% 20.2%20.0% ^(a)Values in parentheses are of the concentration (wt %)^(b)Synthetic GAGP (i.e., GAGP₃-EGFP) was isolated from the medium ofthe recombinant tobacco cell culture. The fused GFP was knocked off bypronase digestion before emulsion property measurement.

In addition, different (GAGP)₃-EGFP fractions which were obtained fromSuperose-6 column fractionation were also assayed and the results areshown in Table 14 which demonstrates that fraction F-2, which containednative GAGP showed the highest emulsifying activity and emulsionstabilizing activity of all fractions tested. These results establishGAGP as the emulsifying component of gum arabic.

TABLE 14 Emulsion Properties of GAGP Protein Fractions separated bySuperose-6 column Fractions F-1 F-2 F-3 F-4 F-5 EA 0.442 0.558 0.2990.081 0.019 ES 74.1% 84.2% 48.5% 32.2% 22.4%

The F-2 fraction was further separated on Hydrophobic Interaction column(HIC). The F-2 fraction was dissolved in 4.2 M NaCl and injected ontothe HIC column. The column was eluted, starting by 4.2 M NaCl, followedby 3.0 M NaCl, 2.0 M NaCl, 1.0 M NaCl, and distilled water. Theresulting fractions were tested and the results are shown in Table 15,which demonstrates that F-2 contains GAGP which is characterized byhaving emulsifying activity and emulsion stabilizing activity. Table 15also demonstrates that F-2 separates into four components which differin hydrophobicity, with the 2.0M and 1.0M NaCl hydrolysates being goodemulsifiers

TABLE 15 Emulsion Properties of F-2 Fractions Separated by HydrophobicInteraction Column 4.2M NaCl 3.0M 2.0M 1.0M Distilled Fractions 1 2 NaClNaCl NaCl water EA 0.076 0.284 0.475 0.710 0.670 0.04 ES 28%    60.5%78.5% 93.5% 94.6% 21.0%

In order to determine the effect of the size of GAGPs on their emulsionactivity and emulsion stabilizing activity, the F-2 fraction containingnative GAGP was incubated in 0.2 N NaOH at 50° C. for 0.5 hr, 1.0 hr,2.0 hr, 4.0 hr, and 8.0 hr and the emulsifying properties of each samplewere determined as shown in Table 16.

TABLE 16 Emulsion Properties of Partially-deglycosylated F-2 Samples 0hr 0.5 hr 1.0 hr 2.0 hr 4.0 hr 8.0 hr EA 0.558 0.354 0.245 0.097 0.0360.011 ES 84.2% 61.2% 41.5% 23.2% 0 0The results in Table 16 demonstrate that both the emulsifying activityand emulsion stabilizing activity of GAGP decrease with decreasing GAGPsize.

To determine whether the carbohydrate moiety of GAGPs affects theiremulsion activity and emulsion stabilizing activity, the F-2 fractionwas partially deglycosylated by anhydrous hydrogen fluoride (HF) asdescribed above, and the emulsifying properties of the deglycosylatedsample were determined. Deglycosylated F-2 fraction had an EA of 0.269,and an ES of 46.5%. These results demonstrate that the GAGP in the F-2fraction lost most of its ability to emulsify, thus indicating theimportance of the carbohydrate moiety of the GAGP for emulsification.

From the above, it should be clear that the present invention provides anew approach and solution to the problem of producing plant gums. Theapproach is not dependent on environmental factors and greatlysimplifies production of a variety of naturally-occurring gums, as wellas designer gums.

1-16. (canceled)
 17. A polynucleotide encoding a fusion polypeptide, theencoded fusion polypeptide comprising a first amino acid comprisingXaa-Hyp-Xaa-Hyp (SEQ ID NO: 9), wherein Xaa is an amino acid chosen fromSer and Ala, and a second amino acid comprising a fusion partnerconsisting of a protein which is not a gum arabic glycoprotein; whereinthe Hyp residues undergo O-glycosylation exclusively witharabinogalactan polysaccharide when the fusion polypeptide is expressedin a plant cell.
 18. A recombinant expression vector comprising thepolynucleotide according to claim
 17. 19. The expression vector of claim18, further comprising a promoter operably linked to the polynucleotide.20. The expression vector of claim 19, wherein the promoter is a viralpromoter.
 21. The expression vector of claim 20, wherein the viralpromoter is selected from the group consisting of the 35S and 19S RNApromoters of cauliflower mosaic virus.
 22. The expression vector ofclaim 19, wherein the polynucleotide further comprises a signal sequenceselected from extensin signal sequence (SEQ ID NO: 14) and tomatoarabinogalactan-protein signal sequence (SEQ ID NO: 215).
 23. Theexpression vector of claim 22, wherein the encoded second amino acid isa reporter gene.
 24. A method for producing a fusion polypeptide,comprising: a) providing: i) a recombinant expression vector accordingto claim 18; and ii) a host plant cell; and b) introducing said vectorinto said host plant cell under conditions such that the fusionpolypeptide is expressed.
 25. The method of claim 24, wherein the hostplant cell is growing in culture.
 26. The method of claim 25, furthercomprising the step of c) recovering the fusion polypeptide from thehost plant cell culture.
 27. The method of claim 24, wherein the plantcell is from the plant family Leguminoseae.